效果页面

先发一个最终效果图的网址:https://www.cnvar.cn/ipostatus/

下载

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

生成markdown table

直到现在为止我们所需要的图表已经制作完成,下面这个步骤并不是必须的,只是将之前下载下来最新的excel表的3各table都转为markdown形式并显示出来。 大概流程是先用pandas读取excel文件,合并所有的sheets,最后读取之前的index.md文件并将table的数据写入。 用pandas读取合并数据之后,我们需要用到tabulate这个模块将其转换为markdown文件形式。

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018

@author: 柯西君_BingWong
#"""
import pandas as pd
from tabulate import tabulate
import glob
import os

#将pandas df转为markdown的function
def df_to_markdown(df, y_index=False):
    blob = tabulate(df, headers='keys', tablefmt='pipe')
    if not y_index:
        # Remove the index with some creative splicing and iteration
        return '\n'.join(['| {}'.format(row.split('|', 2)[-1]) for row in blob.split('\n')])
    return blob

def save_file(content, filename):
    f = open(filename, 'w',encoding="utf-8")
    f.write(content)
    f.close()

#读取excel数据
def data2md(data_path,termination_path,md_path):
    #读取文件夹下最新的excel文件
    data_file = sorted(glob.iglob(data_path + '/*'), key=os.path.getmtime)[-1]
    termination_file = sorted(glob.iglob(termination_path + '/*'), key=os.path.getmtime)[-1]
    md_file = sorted(glob.iglob(md_path + '/*'), key=os.path.getmtime)[-1]
   
    #status data
    f = open(md_file, 'r',encoding='utf8')
    markdown = f.read()


    markdown += '\n# 正常审核状态申请企业情况 \n\n'
    status_file = pd.ExcelFile(data_file, sort = True)
    for sheet in status_file.sheet_names[1:4]:
        markdown += '\n\n## ' + sheet + '\n\n'
        data = status_file.parse(sheet,header=[2],index_col=0,skipfooter=1)
        new_columns = [data.columns[i-1] + "二" if data.columns[i].find("Unnamed") >= 0 else data.columns[i] for i in range(len(data.columns))]
        data.columns = new_columns

        table = df_to_markdown(data)
        markdown += table

    #terminated data
    markdown += '\n\n# 中止审查和终止审查企业情况 \n\n'
    terminated_file = pd.ExcelFile(termination_file, sort = True)
    for sheet in terminated_file.sheet_names[1:4]:
        markdown += '\n\n## ' + sheet + '\n\n'
        data = terminated_file.parse(sheet,header=[2],index_col=0)
        new_columns = [data.columns[i-1] + "二" if data.columns[i].find("Unnamed") >= 0 else data.columns[i] for i in range(len(data.columns))]
        data.columns = new_columns

        table = df_to_markdown(data)
        markdown += table
    #获取5th table,因为格式不一样,所以要分开获取
    sheet = terminated_file.sheet_names[4]
    data = terminated_file.parse(sheet, header=[2], index_col=0)
    table = df_to_markdown(data)
    markdown += '\n\n## ' + sheet + '\n\n'
    markdown += table

    save_file(markdown,'processing/data/index.md')

# data_path = 'data/IPOstatus/data/'
# termination_path = 'data/IPOstatus/termination/'
# md_path = 'data/IPOstatus/md/'
# data2md(data_path,termination_path,md_path)

效果页面

先发一个最终效果图的网址:https://www.cnvar.cn/ipostatus/

下载

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

生成Echarts图表

既然已经爬取了相应的数据和excel文件,下一步就是利用pandas 来清洗数据和pyecharts生成Echarts图表。事实上pyecharts提供了各式各样的图表,其中一项就涉及到将各个省份申请IPO企业的数据进行汇总统计并将其显示到pyecharts提供的地图上。但由于pyecharts并不会自动识别各城市并将其归类到相应的省份(如广州市不会归类为广东省),所以我们需要用到另一个python模块chinese_province_city_area_mapper 来归类数据。

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018

@author: 柯西君_BingWong
#"""
from pyecharts import Style,Map
from pyecharts import  Line, Scatter, EffectScatter, Pie, Gauge, Bar,WordCloud
from pyecharts import Grid, Page
import pandas as pd
from cpca import *
import numpy as np
import glob
import os
import re


#path = 'data/'

WIDTH = 1100
def geo_formatter(params):
    return params.name + ' : ' + params.value + ' 家'

def create_charts(path):
    #read the latest file
    latest_file = sorted(glob.iglob(path + '/*'), key=os.path.getmtime)[-1]

    df_data = pd.DataFrame()
    #从文件名中获取日期
    date = latest_file[len(path):-4]
    status_file = pd.ExcelFile(latest_file, sort = True)
    stat_file = pd.read_csv('processing/data/IPOstatus/stat.csv')
    
    #读取之前爬虫获取的excel文件
    for sheet in status_file.sheet_names[1:]:
        data = status_file.parse(sheet,header=[2],index_col=0,skipfooter=1)
        new_columns = [data.columns[i-1] + "二" if data.columns[i].find("Unnamed") >= 0 else data.columns[i] for i in range(len(data.columns))]
        data.columns = new_columns
        data['date'] = date
        data['板块'] = sheet
        df_data = df_data.append(data, ignore_index=True)
    province = transform(df_data['注册地'].tolist())['省']
    df_data['省'] = [x[:-1]  if len(x)==3 else x for x in province.values]
    df_data.replace('', np.nan, inplace=True)
    df_data['省'].fillna(df_data['注册地'], inplace=True)

    latest_stat = stat_file.iloc[-1]
    date_stat = stat_file['date']
    total_stat = stat_file['total']
    diff_stat = stat_file['total'] - stat_file['total'].shift(1)
    passed_stat = list(stat_file['passed'])
    queue_stat = list(stat_file['queue'])
    failed_stat = list(stat_file['failed'])

##################################################################################
    #下面代码用于生成echarts图表
    page = Page()

    style = Style(
        width=1100, height=600
    )
    value = df_data['省'].value_counts().tolist()
    attr = df_data['省'].value_counts().index.tolist()
    data = [(name,val) for (name,val) in zip(attr,value)]
    chart = Map("IPO申报企业分布图","cnVaR.cn",title_pos='center', **style.init_style)
    chart.add("", attr, value, maptype='china',
              is_visualmap=True,
              is_label_show=True,
              visual_text_color='#000',
              tooltip_formatter=geo_formatter,  # 重点在这里,将函数直接传递为参数。
              label_emphasis_textsize=15,
              label_emphasis_pos='right',
              )
    page.add(chart)

    #
    bar_diff = Bar("")
    bar_diff.add("受理企业总数", date_stat, total_stat)
    bar_diff.add("增长(减少)企业数", date_stat, diff_stat, legend_pos="15%")

    bar_stat = Bar("申报企业情况")
    bar_stat.add("已过会", date_stat, passed_stat, is_stack=True)
    bar_stat.add("待审企业", date_stat, queue_stat, is_stack=True)
    bar_stat.add("中止审查企业", date_stat, failed_stat, is_stack=True,legend_pos="60%")
    
    chart = Grid(width=WIDTH)
    chart.add(bar_stat, grid_left="60%")
    chart.add(bar_diff, grid_right="60%")
    page.add(chart)


    #
    v1 = df_data['所属行业'].value_counts().tolist()
    attr = df_data['所属行业'].value_counts().index.tolist()
    pie = Pie("所属行业分布","cnVaR.cn", title_pos="center", **style.init_style)
    pie.add("", attr, v1, radius=[45, 55], center=[50, 50],
            legend_pos="85%", legend_orient='vertical')
    page.add(pie)

    #
    chart = Pie('申报企业所占板块的比例', "数据来自中国证券监督管理委员会 ",
                title_pos='center', **style.init_style)
    total_counts = df_data['板块'].count()
    for exchange,counts,position in zip(df_data['板块'].unique(),df_data['板块'].value_counts(),range(1,4)):
        chart.add("", [exchange, "总数"], [counts, total_counts], center=[25*position, 30], radius=[28, 34],
                  label_pos='center', is_label_show=True, label_text_color=None, legend_top="center" )
    page.add(chart)
    
    #
    attr1 = [attr.replace("(特殊普通合伙)","").replace('(特殊普通合伙)','').replace('(特殊普通合伙)','')for attr in df_data['会计师事务所'].unique().tolist()]
    attr2 = df_data['保荐机构'].unique().tolist()
    v1 = df_data['会计师事务所'].value_counts().tolist()
    v2 = df_data['保荐机构'].value_counts().tolist()
    #chart_accountants
    chart_accountants = Bar("会计师事务所 - 统计图","cnVaR.cn",title_pos="center", **style.init_style)
    chart_accountants.add("会计师事务所", attr1, v1,legend_pos="75%",
              mark_point=["max", "min"], is_datazoom_show=True, datazoom_range=[0, 40], datazoom_type='both',
              xaxis_interval=0, xaxis_rotate=30, yaxis_rotate=30)
    chart = Grid(width=WIDTH)
    chart.add(chart_accountants, grid_bottom="30%")
    page.add(chart)
    #chart_sponsor
    chart_sponsor = Bar("保荐机构 - 统计图","cnVaR.cn",title_pos="center", **style.init_style)
    chart_sponsor.add("保荐机构", attr2, v2,legend_pos="75%",
              mark_point=["max", "min"], is_datazoom_show=True, datazoom_range=[0, 40], datazoom_type='both',
              xaxis_interval=0, xaxis_rotate=30, yaxis_rotate=30,yaxis_margin=50)
    chart = Grid(width=WIDTH)
    chart.add(chart_sponsor, grid_bottom="30%")
    page.add(chart)

    return page

#create_charts().render('./graph.html')


效果页面

先发一个最终效果图的网址:cnVar.cn

下载

起因

最近我的同事跟我讨论起在某微信公众号看到的IPO发行情况,他说他比较好奇这些数据的来源出处。于是乎,我就把整个证监会网站都找了一遍,最终找到了一个并不起眼的网页:【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 ,一番研究之后确定这个网页就是每周IPO发行情况的数据出处。本来这事应该就此结束,但鉴于这些用excel展示的数据并不直观,于是乎我就打算将其图像化。

步骤

页面数据和excel文件的爬取 -> 读取excel文件并将其合并统计-> 将此表格转为markdown形式(方便放在HEXO上显示)

目录结构

+--main.py
+--processing
|      +--data
|      |      +--graph.html
|      |      +--index.md
|      |      +--IPOstatus
|      |      |      +--data
|      |      |      |      +--20180727.xls
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      |      |      +--md
|      |      |      |      +--20180727.md
|      |      |      |      +--20180803.md
|      |      |      |      +--20180810.md
|      |      |      |      +--20180817.md
|      |      |      |      +--20180824.md
|      |      |      +--stat.csv
|      |      |      +--termination
|      |      |      |      +--20180803.xls
|      |      |      |      +--20180810.xls
|      |      |      |      +--20180817.xls
|      |      |      |      +--20180824.xls
|      +--datatomd.py
|      +--data_crawler.py
|      +--generator.py
|      +--__init__.py

数据爬取

首先分析【行政许可事项】发行监管部首次公开发行股票审核工作流程及申请企业情况 这个页面可以发现,证监会每周都会在此页面更新数据而不更新页面链接,这个为爬虫省下了一些功夫。我们只要用urllibBeautifulSoup就可以达到目的。 首先我们建立一个新的py文件: data_crawler.py, 并复制粘贴下面代码。 下面这段代码是有两个function组成的,最终通过在 main.py下调用此function会生成一个index.md 的markdown文件。

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 23 00:07:51 2018

@author: 柯西君_BingWong
#"""
import urllib
from bs4 import BeautifulSoup
import re
import os
import csv

#DIRTH_DATA = './data/'
#DIRTH_MD = './md/'
#url = "http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201803/t20180324_335702.htm"

#将爬取后的数据保存为文件
def save_file(content, filename):
    f = open(filename, 'w',encoding="utf-8")
    f.write(content)
    f.close()

#利用urllib和BeautifulSoup解析网页    
def parse(url,DIRTH_DATA,DIRTH_TERMINATION,DIRTH_MD):
    try:
        html = urllib.request.urlopen(url)
        soup = BeautifulSoup(html, 'lxml')
        
        title = soup.title.string[8:]
                
        #description & statistics
        text = ""
        stat = []
        for p in soup.find_all(name='p')[-1]:
            text += str(p).replace("<span>","").replace("</span>","")
        description = (text[4:])    
        stat = re.findall('\d+', description ) # get all the stat number such as total firms, passed and failed ones
        stat = stat[3:]
        #file links
        links = ""
        for link in soup.find_all(re.compile("^a"))[:3]:
            links += "[{}]({}{})\n".format(link.string,url[:-21],link['href'][1:])
        
        #date    
        date = soup.select("#headContainer span")[2].string
        date = date.replace('年','').replace('月','').replace('日','')
        
        #generate markdown as output file
        markdown = """---\ntitle: 首次公开发行股票申请企业情况\ncomment: false\ndate: \n---\n"""
        markdown += """\n{}\n{}\n<iframe src = "graph.html" width="1200px"  height="3000px" frameborder=0 marginheight=0  marginwidth=0 scrolling="no"></iframe> \n
        """.format( description,  links)

        if not os.path.exists(DIRTH_DATA + date +".xls"): 
            #save md
            file_name = DIRTH_MD + date + '.md'
            save_file(markdown,file_name)
            save_file(markdown,'processing/data/index.md')
                
            #download xls file
            status_name = DIRTH_DATA + date + '.xls'
            status_file = soup.find_all(re.compile("^a"))[1]
            status_file = url[:-21] + status_file['href'][1:]
            urllib.request.urlretrieve(status_file, status_name)

            termination_name = DIRTH_TERMINATION + date + '.xls'
            termination_file = soup.find_all(re.compile("^a"))[2]
            termination_file = url[:-21] + termination_file['href'][1:]
            urllib.request.urlretrieve(termination_file, termination_name)

            #apend stat to csv file
            stat.insert(0,date)
            with open('processing/data/IPOstatus/stat.csv', 'a') as f:
                writer = csv.writer(f)
                writer.writerow(stat) 
        
        else:
            print('数据已于'+ date +'更新,无需再次操作!')
        
    except urllib.error.URLError as e:
        print(e)

#parse(url,DIRTH_DATA,DIRTH_MD)

chaqi.net

This navigation website was initially designed by BYR-Navi with jekyll.
I use it as my personal navigation website and I really like its design concept. But I am not familiar with Jekyll and feel like I need a website more than just a static HTML navigation website. Thus, I modified it with Pyhton and Plotly DASH.

In general, I use Plotly DASH for data visualization only. But I really want to know what else Dash can do beside it. So here is the one.

You may notice the header,footer and layout will be created dynamically with page load. The reason is that the data for header,footer and layout will be generated and updated with page refresh without restarting the whole DASH app again.

Demo

Link: ChaQi.net

Download


BUG

  • [ ] The DASH app wont interact with custom JS for Semantic UI Popup.
  • [ ] Using html.Form as enter keywill trigger the callback even with empty value of dcc.Input.
  • [ ] Favicon and Meta wont be added or displayed on VPS.

【广东规划出台:超大城市为广州、深圳】财联社3月13日讯,广东省政府网近日发布《广东省人口发展规划(2017-2030年)》,到2020年,全省超大城市、特大城市各有两座。其中超大城市为广州、深圳,特大城市为佛山和东莞。作为“特大城市”的佛山,还将进一步增加人口集聚,到2020年的人口规模等级为500万人-1000万人。

任泽平曾经说过这么一句话:我研究了10年房地产,把它总结为一句话,“长期看人口、中期看土地、短期看金融”,大家记住这句话就行了。

这几天央妈出了新规,住这地区附近的可以选择投资房子。具体原因就不做解释了。