Python爬取真气网天气数据

2019年7月1日 625次阅读来源: 最爱python编程

内容目录

一、目的二、实现内容1.分析html、url结构2.编写爬虫程序2.1 组合url2.2 获取月均数据2.3 城市日均值四、不足之处五、补充内容1.其他空气污染数据2.城市经纬度信息

学习Python中有不明白推荐加入交流群

号：984632579
群里有志同道合的小伙伴，互帮互助，
群里有不错的视频学习教程和PDF！

一、目的

从中国空气质量在线监测分析平台抓取全国384个城市2013年以来的月均和日均空气污染相关数据。
数据地址：https://www.aqistudy.cn/historydata/

二、实现内容

1.分析html、url结构

首先选择月数据进行爬虫测试，在网站中选择一个城市，分析其月均数据网页结构。
在这里我选择了北京，在页面里可以看到北京从2013.12以来的每个月份的空气污染数据，右击表格中的任意数据，用谷歌开发者工具进行检查，可以在Elements得到网页呈现的完整的html信息。这里可以看到我们所需要的时间、空气污染要素信息都存放在中（这里我理所当然地认为当进行页面请求时会直接获取到完整的html并能从中提取到需要的信息，其实并没有那么简单）。

在发现能够从html中找到我们所需要的数据后，就开始检查需要爬取网页的url是否能够进行组合。我需要的数据包括城市月均数据、城市日均数据，查看了北京的月均数据的url为https://www.aqistudy.cn/historydata/monthdata.php?city=北京，日均数据url为https://www.aqistudy.cn/historydata/daydata.php?city=北京&month=201312。月均数据每个城市存放在一个页面中，日均数据每个城市每个月份的日均值存放在一个页面中。对url进行简单分析可以看到月均的url变化的只有对应城市的名称，日均url中变化的只有月份和城市，这些都可以进行简单的组合获得。
在确认了能从html中获取信息，并可以合成url后，就开始程序的编写。

2.编写爬虫程序

2.1 组合url

需要爬取的月份列表比较好获取

def get_month_set():
    month_set = ['201312']
    for year in [2014,2015,2016,2017,2018]:
        for month in ['01','02','03','04','05','06',
                      '07','08','09','10','11','12']:
            month_set.append('%s%s'%(year,month))
    month_set.extend(['201901','201902','201903'])
return month_set

城市的列表相对较麻烦，不过可以在https://www.aqistudy.cn/historydata/中找到所有城市的信息。编写python程序，获取html，并从”*”中根据正则判断提取需要的城市名称，输出到文本中。

2.2 获取月均数据

同样以北京为例，北京月均数据的url为https://www.aqistudy.cn/historydata/monthdata.php?city=北京，按照获取城市列表的方法，先用requests获取html，再用beautifulsoup 找寻其中内容。

city = '北京'
url = 'https://www.aqistudy.cn/historydata/monthdata.php?city=%s'%(city)
html = requests.get(url)
soup = BeautifulSoup(html.text)
td_lists=soup.find_all('td')

可是最后得到的td_lists中没有有效值，打印出html中所有内容发现html中并没有我们想要的表格内容，有的只有一段function 内容中存在td信息，这里说明数据是以JavaScript动态输入到html中的，使用requests方法并不能获取到需要的完整的html。

function showTable(items) {
  items.forEach(function(item) {
    // $('.table tbody').append(`
    //   <tr>
    //     <td align="center"><a href="daydata.php?city=${city}&month=${item.time_point}">${item.time_point}</a></td>
    //     <td align="center">${item.aqi}</td>
    //     <td align="center">${item.min_aqi}~${item.max_aqi}</td>
    //     <td align="center"><span style="display:block;width:60px;text-align:center; ${getAQIStyle(item.aqi)}">${item.quality}</span></td>
    //     <td align="center">${item.pm2_5}</td>
    //     <td align="center">${item.pm10}</td>
    //     <td align="center">${item.so2}</td>
    //     <td align="center">${item.co}</td>
    //     <td align="center">${item.no2}</td>
    //     <td align="center">${item.o3}</td>
    //   </tr>`);

这里我找了一些方法，最后选择通过使用selenium.webdriver的方法来访问服务器获取完整的html。
selenium.webdriver简单的理解就是利用浏览器原生的API，封装成一套更加面向对象的SeleniumWebDriverAPI，直接操作浏览器页面里的元素，甚至操作浏览器本身。
这里我选用chrome浏览器，可以从http://chromedriver.chromium.org/上根据本机chrome版本下载对应的chromedriver。

下面是对应代码，这里需要注意的是在driver.get(url)后需要加上等待时间，浏览器访问网页并返回完整的内容需要时间，不然无法获取正确的网页。
这里还用到了pandas模块里的read_html函数，可以直接提取页面中的表格内容，并存为pandas.dataframe结构。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from urllib import parse
import pandas as pd

#获取城市信息
def get_city_set():
    with open('./cities.txt' ,'r')as f:
        reader = f.readlines()
    for i in range(len(reader)):
        reader[i]= reader[i].split('\n')[0]
    return reader
#获取城市信息
city_set = get_city_set()

#浏览器不提供可视化页面
chrome_options = Options()
chrome_options.add_argument('--headless')

count=0
city_month_data = []

file_name = 'C:/Temp/data_out/全国城市空气污染月均数据.txt'
fp = open(file_name, 'w')
#月均数据
for city in city_set:
#    city = '南京'
    #组成城市月均值url
    url = 'https://www.aqistudy.cn/historydata/monthdata.php?city=%s'%(city)
    #打开浏览器
    driver = webdriver.Chrome('C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe',chrome_options=chrome_options)
    #访问对应url
    driver.get(url)
    #等待浏览器加载页面---------重要-----------
    time.sleep(1)
    #获取页面中的表格
    dfs = pd.read_html(driver.page_source,header=0)[0]
    #判断dfs是否有数据，没有数据则增加等待时间
    if len(dfs)==0:
        driver.get(url)
        print('Please wait %s seconds.'%(3))
        time.sleep(3)
        dfs = pd.read_html(driver.page_source,header=0)[0]
    #存储数据到文件
    for j in range(0,len(dfs)):
        date = dfs.iloc[j,0]
        aqi = dfs.iloc[j,1]
        grade = dfs.iloc[j,2]
        pm25 = dfs.iloc[j,3]
        pm10 = dfs.iloc[j,4]
        so2 = dfs.iloc[j,5]
        co = dfs.iloc[j,6]
        no2 = dfs.iloc[j,7]
        o3 = dfs.iloc[j,8]
        fp.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (city,date,aqi,grade,pm25,pm10,so2,co,no2,o3)))
    print('%d---%s---DONE' % (len(dfs), city))
    #关闭浏览器
    driver.quit()
    count +=1
print ('%s已经爬完！请检测！'%(city))
fp.close()

程序开始运行

2.3 城市日均值

日均值爬虫逻辑与月均类似，只是在组合url的时候加入了月份，并按照城市为单位存入不同文件，具体代码如下：

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from urllib import parse
import pandas as pd

#下载城市对应月份数据
def download_city(city_set,month_set):
    count= 0
    base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='
    for city in city_set:   #####
    #city='成都'
        driver = webdriver.Chrome('C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe',chrome_options=chrome_options)
        file_name = 'C:/Temp/data_out/日均值_%s.txt'%(city)
        fp = open(file_name, 'w')
        for month in month_set:
            weburl = ('%s%s&month=%s' % (base_url, parse.quote(city),month))
            driver.get(weburl)
            time.sleep(1)
            dfs = pd.read_html(driver.page_source,header=0)[0]
            if len(dfs)==0:
                driver.get(weburl)
                print('Please wait %s seconds.'%(3))
                time.sleep(3)
                dfs = pd.read_html(driver.page_source,header=0)[0]
                if len(dfs)==0:
                    print('%d---%d---%s---%s---DONE' % (count,len(dfs), city,month))
                    continue
            for j in range(0,len(dfs)):
                date = dfs.iloc[j,0]
                aqi = dfs.iloc[j,1]
                grade = dfs.iloc[j,2]
                pm25 = dfs.iloc[j,3]
                pm10 = dfs.iloc[j,4]
                so2 = dfs.iloc[j,5]
                co = dfs.iloc[j,6]
                no2 = dfs.iloc[j,7]
                o3 = dfs.iloc[j,8]
                fp.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % (city,date,aqi,grade,pm25,pm10,so2,co,no2,o3)))
            print('%d---%d---%s---%s---DONE' % (count,len(dfs), city,month))
        fp.close()
        driver.quit()
        print ('%s已经爬完！请检测！'%(city))
        count +=1
#获取城市信息
def get_city_set():
    with open('./cities.txt' ,'r')as f:
        reader = f.readlines()
    for i in range(len(reader)):
        reader[i]= reader[i].split('\n')[0]
    return reader
#获取所需要的月份
def get_month_set():
    month_set = ['201312']
    for year in [2014,2015,2016,2017,2018]:
        for month in ['01','02','03','04','05','06',
                      '07','08','09','10','11','12']:
            month_set.append('%s%s'%(year,month))
    month_set.extend(['201901','201902','201903'])
    return month_set

if __name__ == '__main__':
    #获取月份数据
    month_set=get_month_set()
    #获取城市信息
    city_set = get_city_set()

    #浏览器不提供可视化页面
    chrome_options = Options()
    chrome_options.add_argument('--headless')

    #下载城市对应月份数据
    download_city(city_set,month_set)

四、不足之处

获取单个城市所有时间的数据需要时间不多，但是对于接近400个城市，单线程会花费很多时间，之后还需要将其改为多线程并发来节约时间。

五、补充内容

1.其他空气污染数据

在找寻空气污染相关历史数据的时候，我还发现了另外一个全国空气质量历史数据。网站了整理全国多个城市、监测站点的小时历史数据，并且每天进行更新。

2.城市经纬度信息

我根据搜集了网站上所有城市、地区的经纬度信息，大多数是通过互联网已有的城市经纬度获取，有些地区并没有能够从中找到，我手动搜索补充完整了，完整的城市对应的经纬度文件在原文链接中可以获取（提取码：s4vj）。

    原文作者：最爱python编程
    原文地址: https://www.jianshu.com/p/c588c6c532c8
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。