最近一直在写FLASK+JS的网站,到了终于写好了之后发现 :我不懂部署啊?!

最后研究了一番之后决定用docker来区分开mongodb和FLASK的应用。
其中的FLASK 部署用到了以下文章:

用docker部署flask+gunicorn+nginx
https://www.cnblogs.com/xuanmanstein/p/7692256.html

其实作者写了一大堆东西,很多都是已经写好了,压根直接运行就可以,例如:

原作者是在build镜像的时候把代码全都COPY进去了,我改成只copy  requirements.txt 用于安装python包就可以了。代码之类的我还是喜欢用-v挂载。

其实用-v挂载FLASK 已经写入了docker-compose里面,我看了半天才领会到,以为运行的时候还要写什么命令。
但对于我这种小白来说,看了作者写了一大堆东西最后还是不会运行。
最后其实只需要在目标文件夹下运行下面代码就可以运行了:

docker-compose up

总的来说,具体步骤为:

  1. 先将到github的nginx-gunicorn-flask下载整个文件夹,
  2. 将写好的FLASK放入App这个文件夹中,
  3. 然后在CMD上的目标文件夹nginx-gunicorn-flask(自定义目录名)创建image:
sudo docker build -t 'cnf10' .
注意:cnf10为image的自定义名称, 名称后面有‘  .’ , 没有‘  .’ 会创建出错
  1. 创建image后可以运行以下命令来临时运行image,用于测试:
sudo docker run -it --rm -p 80:80 -p 27017:27017  cnf10
注意:
-t 是启动终端
-i是允许进行交互
--rm,在容器启动时设置--rm选项,这样在容器退出时就能够自动清理容器内部的文件系统
-p 为需要释放的端口
cnf10为image名称
  1. 直接运行下面代码启动
docker-compose up

误装和删除

最近写了一个app,于是乎打算用docker的image来区分mongodb和FLASK app。在本地创建好了image用docker-compose up是可以完美运行的,但当我将代码放到VPS上就出现以下错误:

The program 'docker-compose' is currently not installed. You can install it by typing:
apt install docker-compose

其实我老早就安装了docker,不知道为啥docker-compose需要另外分开安装,太奇怪了。于是我就安装提示输入apt install docker-compose 来安装。安装完成后又出现了另外一个错误:

root@ubuntu:~/cnF10# docker-compose up
ERROR: Version in "./docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a version of "2" (or "2.0") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

都什么情况!!!

root@ubuntu:~/cnF10# docker-compose --version
docker-compose version 1.8.0, build unknown

我查了查安装后的版本,显示build unknown。我第一时间怀疑是版本安装错误,后来打算删除再安装,于是我输入rm /usr/local/bin/docker-compose来删除docker-compose,结果:

root@ubuntu:~/cnF10# rm /usr/local/bin/docker-compose
rm: cannot remove '/usr/local/bin/docker-compose': No such file or directory

google了很多帖子,最后通过下面的命令解决:

ln -sf /usr/local/bin/docker-compose /usr/bin/docker-compose

来源:https://github.com/docker/compose/issues/3371

重新安装

经过上面的坑之后,发现不能通过简单的apt来安装,所以我走一趟官网,根据正确步骤安装,最后是成功运行了。

官网:https://docs.docker.com/compose/install/#install-compose

  1. Run this command to download the latest version of Docker Compose:
sudo curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

2.Apply executable permissions to the binary:

sudo chmod +x /usr/local/bin/docker-compose

4.Test the installation.

docker-compose --version

如无意外应该可以正常运行docker-compose了。

效果页面

先发一个最终效果图的网址:https://www.cnvar.cn/ipostatus/

下载

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

最后步骤

到目前为止,所有步骤已经完成了。现在就要用到main.py 来调用这些文件中的functions。

# -*- coding: utf-8 -*-
"""
@author: 柯西君_BingWong
#"""

from processing.datatomd import data2md
from processing.data_crawler import parse
from processing.generator import create_charts


def main():
   parse(url,DIRTH_DATA,DIRTH_TERMINATION,DIRTH_MD)
   create_charts(DIRTH_DATA).render('processing/data/graph.html')
   data2md(DIRTH_DATA, DIRTH_TERMINATION, DIRTH_MD)

if __name__ == '__main__':    
    DIRTH_DATA = 'processing/data/IPOstatus/data/'
    DIRTH_TERMINATION = 'processing/data/IPOstatus/termination/'
    DIRTH_MD = 'processing/data/IPOstatus/md/'
    url = "http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201803/t20180324_335702.htm"

    main()

效果页面

先发一个最终效果图的网址:https://www.cnvar.cn/ipostatus/

下载

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

生成markdown table

直到现在为止我们所需要的图表已经制作完成,下面这个步骤并不是必须的,只是将之前下载下来最新的excel表的3各table都转为markdown形式并显示出来。 大概流程是先用pandas读取excel文件,合并所有的sheets,最后读取之前的index.md文件并将table的数据写入。 用pandas读取合并数据之后,我们需要用到tabulate这个模块将其转换为markdown文件形式。

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018

@author: 柯西君_BingWong
#"""
import pandas as pd
from tabulate import tabulate
import glob
import os

#将pandas df转为markdown的function
def df_to_markdown(df, y_index=False):
    blob = tabulate(df, headers='keys', tablefmt='pipe')
    if not y_index:
        # Remove the index with some creative splicing and iteration
        return '\n'.join(['| {}'.format(row.split('|', 2)[-1]) for row in blob.split('\n')])
    return blob

def save_file(content, filename):
    f = open(filename, 'w',encoding="utf-8")
    f.write(content)
    f.close()

#读取excel数据
def data2md(data_path,termination_path,md_path):
    #读取文件夹下最新的excel文件
    data_file = sorted(glob.iglob(data_path + '/*'), key=os.path.getmtime)[-1]
    termination_file = sorted(glob.iglob(termination_path + '/*'), key=os.path.getmtime)[-1]
    md_file = sorted(glob.iglob(md_path + '/*'), key=os.path.getmtime)[-1]
   
    #status data
    f = open(md_file, 'r',encoding='utf8')
    markdown = f.read()


    markdown += '\n# 正常审核状态申请企业情况 \n\n'
    status_file = pd.ExcelFile(data_file, sort = True)
    for sheet in status_file.sheet_names[1:4]:
        markdown += '\n\n## ' + sheet + '\n\n'
        data = status_file.parse(sheet,header=[2],index_col=0,skipfooter=1)
        new_columns = [data.columns[i-1] + "二" if data.columns[i].find("Unnamed") >= 0 else data.columns[i] for i in range(len(data.columns))]
        data.columns = new_columns

        table = df_to_markdown(data)
        markdown += table

    #terminated data
    markdown += '\n\n# 中止审查和终止审查企业情况 \n\n'
    terminated_file = pd.ExcelFile(termination_file, sort = True)
    for sheet in terminated_file.sheet_names[1:4]:
        markdown += '\n\n## ' + sheet + '\n\n'
        data = terminated_file.parse(sheet,header=[2],index_col=0)
        new_columns = [data.columns[i-1] + "二" if data.columns[i].find("Unnamed") >= 0 else data.columns[i] for i in range(len(data.columns))]
        data.columns = new_columns

        table = df_to_markdown(data)
        markdown += table
    #获取5th table,因为格式不一样,所以要分开获取
    sheet = terminated_file.sheet_names[4]
    data = terminated_file.parse(sheet, header=[2], index_col=0)
    table = df_to_markdown(data)
    markdown += '\n\n## ' + sheet + '\n\n'
    markdown += table

    save_file(markdown,'processing/data/index.md')

# data_path = 'data/IPOstatus/data/'
# termination_path = 'data/IPOstatus/termination/'
# md_path = 'data/IPOstatus/md/'
# data2md(data_path,termination_path,md_path)

效果页面

先发一个最终效果图的网址:https://www.cnvar.cn/ipostatus/

下载

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

生成Echarts图表

既然已经爬取了相应的数据和excel文件,下一步就是利用pandas 来清洗数据和pyecharts生成Echarts图表。事实上pyecharts提供了各式各样的图表,其中一项就涉及到将各个省份申请IPO企业的数据进行汇总统计并将其显示到pyecharts提供的地图上。但由于pyecharts并不会自动识别各城市并将其归类到相应的省份(如广州市不会归类为广东省),所以我们需要用到另一个python模块chinese_province_city_area_mapper 来归类数据。

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018

@author: 柯西君_BingWong
#"""
from pyecharts import Style,Map
from pyecharts import  Line, Scatter, EffectScatter, Pie, Gauge, Bar,WordCloud
from pyecharts import Grid, Page
import pandas as pd
from cpca import *
import numpy as np
import glob
import os
import re


#path = 'data/'

WIDTH = 1100
def geo_formatter(params):
    return params.name + ' : ' + params.value + ' 家'

def create_charts(path):
    #read the latest file
    latest_file = sorted(glob.iglob(path + '/*'), key=os.path.getmtime)[-1]

    df_data = pd.DataFrame()
    #从文件名中获取日期
    date = latest_file[len(path):-4]
    status_file = pd.ExcelFile(latest_file, sort = True)
    stat_file = pd.read_csv('processing/data/IPOstatus/stat.csv')
    
    #读取之前爬虫获取的excel文件
    for sheet in status_file.sheet_names[1:]:
        data = status_file.parse(sheet,header=[2],index_col=0,skipfooter=1)
        new_columns = [data.columns[i-1] + "二" if data.columns[i].find("Unnamed") >= 0 else data.columns[i] for i in range(len(data.columns))]
        data.columns = new_columns
        data['date'] = date
        data['板块'] = sheet
        df_data = df_data.append(data, ignore_index=True)
    province = transform(df_data['注册地'].tolist())['省']
    df_data['省'] = [x[:-1]  if len(x)==3 else x for x in province.values]
    df_data.replace('', np.nan, inplace=True)
    df_data['省'].fillna(df_data['注册地'], inplace=True)

    latest_stat = stat_file.iloc[-1]
    date_stat = stat_file['date']
    total_stat = stat_file['total']
    diff_stat = stat_file['total'] - stat_file['total'].shift(1)
    passed_stat = list(stat_file['passed'])
    queue_stat = list(stat_file['queue'])
    failed_stat = list(stat_file['failed'])

##################################################################################
    #下面代码用于生成echarts图表
    page = Page()

    style = Style(
        width=1100, height=600
    )
    value = df_data['省'].value_counts().tolist()
    attr = df_data['省'].value_counts().index.tolist()
    data = [(name,val) for (name,val) in zip(attr,value)]
    chart = Map("IPO申报企业分布图","cnVaR.cn",title_pos='center', **style.init_style)
    chart.add("", attr, value, maptype='china',
              is_visualmap=True,
              is_label_show=True,
              visual_text_color='#000',
              tooltip_formatter=geo_formatter,  # 重点在这里,将函数直接传递为参数。
              label_emphasis_textsize=15,
              label_emphasis_pos='right',
              )
    page.add(chart)

    #
    bar_diff = Bar("")
    bar_diff.add("受理企业总数", date_stat, total_stat)
    bar_diff.add("增长(减少)企业数", date_stat, diff_stat, legend_pos="15%")

    bar_stat = Bar("申报企业情况")
    bar_stat.add("已过会", date_stat, passed_stat, is_stack=True)
    bar_stat.add("待审企业", date_stat, queue_stat, is_stack=True)
    bar_stat.add("中止审查企业", date_stat, failed_stat, is_stack=True,legend_pos="60%")
    
    chart = Grid(width=WIDTH)
    chart.add(bar_stat, grid_left="60%")
    chart.add(bar_diff, grid_right="60%")
    page.add(chart)


    #
    v1 = df_data['所属行业'].value_counts().tolist()
    attr = df_data['所属行业'].value_counts().index.tolist()
    pie = Pie("所属行业分布","cnVaR.cn", title_pos="center", **style.init_style)
    pie.add("", attr, v1, radius=[45, 55], center=[50, 50],
            legend_pos="85%", legend_orient='vertical')
    page.add(pie)

    #
    chart = Pie('申报企业所占板块的比例', "数据来自中国证券监督管理委员会 ",
                title_pos='center', **style.init_style)
    total_counts = df_data['板块'].count()
    for exchange,counts,position in zip(df_data['板块'].unique(),df_data['板块'].value_counts(),range(1,4)):
        chart.add("", [exchange, "总数"], [counts, total_counts], center=[25*position, 30], radius=[28, 34],
                  label_pos='center', is_label_show=True, label_text_color=None, legend_top="center" )
    page.add(chart)
    
    #
    attr1 = [attr.replace("(特殊普通合伙)","").replace('(特殊普通合伙)','').replace('(特殊普通合伙)','')for attr in df_data['会计师事务所'].unique().tolist()]
    attr2 = df_data['保荐机构'].unique().tolist()
    v1 = df_data['会计师事务所'].value_counts().tolist()
    v2 = df_data['保荐机构'].value_counts().tolist()
    #chart_accountants
    chart_accountants = Bar("会计师事务所 - 统计图","cnVaR.cn",title_pos="center", **style.init_style)
    chart_accountants.add("会计师事务所", attr1, v1,legend_pos="75%",
              mark_point=["max", "min"], is_datazoom_show=True, datazoom_range=[0, 40], datazoom_type='both',
              xaxis_interval=0, xaxis_rotate=30, yaxis_rotate=30)
    chart = Grid(width=WIDTH)
    chart.add(chart_accountants, grid_bottom="30%")
    page.add(chart)
    #chart_sponsor
    chart_sponsor = Bar("保荐机构 - 统计图","cnVaR.cn",title_pos="center", **style.init_style)
    chart_sponsor.add("保荐机构", attr2, v2,legend_pos="75%",
              mark_point=["max", "min"], is_datazoom_show=True, datazoom_range=[0, 40], datazoom_type='both',
              xaxis_interval=0, xaxis_rotate=30, yaxis_rotate=30,yaxis_margin=50)
    chart = Grid(width=WIDTH)
    chart.add(chart_sponsor, grid_bottom="30%")
    page.add(chart)

    return page

#create_charts().render('./graph.html')


效果页面

先发一个最终效果图的网址:cnVar.cn

下载

起因

最近我的同事跟我讨论起在某微信公众号看到的IPO发行情况,他说他比较好奇这些数据的来源出处。于是乎,我就把整个证监会网站都找了一遍,最终找到了一个并不起眼的网页:【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 ,一番研究之后确定这个网页就是每周IPO发行情况的数据出处。本来这事应该就此结束,但鉴于这些用excel展示的数据并不直观,于是乎我就打算将其图像化。

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

数据爬取

首先分析【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 这个页面可以发现,证监会每周都会在此页面更新数据而不更新页面链接,这个为爬虫省下了一些功夫。我们只要用urllibBeautifulSoup就可以达到目的。 首先我们建立一个新的py文件: data_crawler.py, 并复制粘贴下面代码。 下面这段代码是有两个function组成的,最终通过在 main.py下调用此function会生成一个index.md 的markdown文件。

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018

@author: 柯西君_BingWong
#"""
import urllib
from bs4 import BeautifulSoup
import re
import os
import csv

#DIRTH_DATA = './data/'
#DIRTH_MD = './md/'
#url = "http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201803/t20180324_335702.htm"

#将爬取后的数据保存为文件
def save_file(content, filename):
    f = open(filename, 'w',encoding="utf-8")
    f.write(content)
    f.close()

#利用urllib和BeautifulSoup解析网页    
def parse(url,DIRTH_DATA,DIRTH_TERMINATION,DIRTH_MD):
    try:
        html = urllib.request.urlopen(url)
        soup = BeautifulSoup(html, 'lxml')
        
        title = soup.title.string[8:]
                
        #description & statistics
        text = ""
        stat = []
        for p in soup.find_all(name='p')[-1]:
            text += str(p).replace("<span>","").replace("</span>","")
        description = (text[4:])    
        stat = re.findall('\d+', description ) # get all the stat number such as total firms, passed and failed ones
        stat = stat[3:]
        #file links
        links = ""
        for link in soup.find_all(re.compile("^a"))[:3]:
            links += "[{}]({}{})\n".format(link.string,url[:-21],link['href'][1:])
        
        #date    
        date = soup.select("#headContainer span")[2].string
        date = date.replace('年','').replace('月','').replace('日','')
        
        #generate markdown as output file
        markdown = """---\ntitle: 首次公开发行股票申请企业情况\ncomment: false\ndate: \n---\n"""
        markdown += """\n{}\n{}\n<iframe src = "graph.html" width="1200px"  height="3000px" frameborder=0 marginheight=0  marginwidth=0 scrolling="no"></iframe> \n
        """.format( description,  links)

        if not os.path.exists(DIRTH_DATA + date +".xls"): 
            #save md
            file_name = DIRTH_MD + date + '.md'
            save_file(markdown,file_name)
            save_file(markdown,'processing/data/index.md')
                
            #download xls file
            status_name = DIRTH_DATA + date + '.xls'
            status_file = soup.find_all(re.compile("^a"))[1]
            status_file = url[:-21] + status_file['href'][1:]
            urllib.request.urlretrieve(status_file, status_name)

            termination_name = DIRTH_TERMINATION + date + '.xls'
            termination_file = soup.find_all(re.compile("^a"))[2]
            termination_file = url[:-21] + termination_file['href'][1:]
            urllib.request.urlretrieve(termination_file, termination_name)

            #apend stat to csv file
            stat.insert(0,date)
            with open('processing/data/IPOstatus/stat.csv', 'a') as f:
                writer = csv.writer(f)
                writer.writerow(stat) 
        
        else:
            print('数据已于'+ date +'更新,无需再次操作!')
        
    except urllib.error.URLError as e:
        print(e)

#parse(url,DIRTH_DATA,DIRTH_MD)