看官稍安勿躁,不断填充中。。。
一、爬虫基础
为了从丰富的互联网数据中获取到想要的信息,故拟学习爬虫技术,从网上爬取数据。选用工具:python+mongodb
网上这方面的入门资料非常多,列举几个如下:
China’s Prices Project(CPP)课题组–厦门大学
经过一周多的学习,使用python抓取简单网页的数据,并写入mongodb的路子已经打通,目前,以成都市城乡房产管理局的即时交易数据为例,爬取相关数据存入mongodb。
现在是每天晚上11点爬取当天的成交数据,等到数据量达到一定程度后,会对数据进行分析,并作出图表化的展示,这也是后续的待办事项。
使用到的组件:
- requests — http请求
- BeautifulSoup — html解析
- MongoClient — mongodb操作
- copy — 对象复制
- time — 系统时间和日期获取
- schedule — 定时任务
- logging,logging-config — 运行日志记录
源码:
# --*-- coding: utf-8 --*--
__Author__ = "Leng Fuping"
__Doc__ = "get marketing info from http://www.cdfgj.gov.cn/, and store data into csv files.";
import requests
from bs4 import BeautifulSoup
import codecs
import csv
from pymongo import MongoClient
import copy
import time
import schedule
import logging
import logging.config
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='get_house_marketing_info_for_CD.log',
filemode='a') #default 'a', if set 'w', the log file will be rewrite every runtime.
def getHTML(url):
r = requests.get(url);
logging.debug("The content getted from url [ %s] is: %s",url, r.content)
return r.content;
def parseHTML(html):
utfhtml = html.decode('UTF-8', 'ignore');
soup = BeautifulSoup(utfhtml, 'html.parser',from_encoding='utf8')
marketingList = []
targetTables = soup.find_all('table', attrs={"width":"550px;"});
for targetTable in targetTables:
marketingInfo = []
marketingType = None
try:
marketingType = targetTable.find('td', attrs={'align':'center','class':'yellow'}).get_text()
except Exception as e:
logging.exception("msg%s ",'Error')
else:
logging.debug("The marketingType is: %s", marketingType)
if not marketingType is None:
marketingDict={'marketingTitle': marketingType.replace('\r\n','').replace(' ','')}
marketingTable = targetTable.find('table', attrs={'bgcolor':'#BCBCBC','class':'blank'})
marketingTrItems = marketingTable.find_all('tr', attrs={'bgcolor':'#FFFFFF'})
for marketingTrItem in marketingTrItems:
marketingItem = []
for marketingTdItem in marketingTrItem.find_all('td', attrs={'bgcolor':'#FFFFFF'}):
marketingItem.append(marketingTdItem.get_text().replace('\r\n','').replace(' ',''))
marketingInfo.append(marketingItem)
marketingDict['marketingInfo'] = marketingInfo
marketingList.append(marketingDict)
else:
logging.warning("this table is not need!")
return marketingList
def writeCSV(file_name,data_list):
with codecs.open(file_name,'w') as f:
writer = csv.writer(f)
for (data) in data_list:
writer.writerow(data)
def writeMongoDB(collection,data_list):
for data in data_list:
targetTuple = {
'dist':data['dist'],
'type':data['type'],
'cdate':data['cdate']
}
collection.update(targetTuple,data,True)
logging.info("Insert one records into mongoDB: %s",data)
def formatDataForMongoDB(data_list):
formated_data_list = []
for datarow in data_list:
dataTitle = ('dist','area','harea','hnum');
today = time.strftime('%Y%m%d',time.localtime(time.time()))
commonmap = {}
commonmap["type"] = datarow["marketingTitle"]
commonmap["cdate"] = today
marketingInfo = datarow["marketingInfo"]
for item in marketingInfo:
i = 0
datamap = copy.copy(commonmap)
datamap["lutime"] = time.asctime(time.localtime())
while(i<len(item)):
datamap[dataTitle[i]] = item[i]
i = i+1
formated_data_list.append(datamap)
return formated_data_list
def job():
#获取当前时间,并转换成日期
today = time.strftime('%Y%m%d',time.localtime(time.time()))
logging.info("begin to collect INFO...Today is: %s",today)
#1、get and parse INFO from network.
logging.info("get and parse INFO from network begin...")
URL = 'http://www.cdfgj.gov.cn/SCXX/Default.aspx'
html = getHTML(URL)
marketingList = parseHTML(html)
logging.info("get and parse INFO from network FINISHED!")
#2、write info to csv file
logging.info("write info to csv file begin...")
for marketingInfo in marketingList:
writeCSV(marketingInfo['marketingTitle']+today+".csv", marketingInfo['marketingInfo'])
logging.info("write info to csv file FINISHED!")
#3、write info to mongoDB
#创建mongodb连接
client = MongoClient("localhost", 27017)
db = client.housemarketing
#连接数据库
collection = db.housemarketing
logging.info("write INFO into mongodb begin...")
writeMongoDB(collection,formatDataForMongoDB(marketingList))
logging.info("write INFO into mongodb FINISHED! writen data list size is: %s",len(marketingList))
logging.info("collect data finished. say you tomorrow! ")
logging.info("**************************************************************************************")
schedule.every().day.at("23:00").do(job)
while True:
schedule.run_pending()
time.sleep(1)
二、将爬虫应用到‘一带一路’
回过头来,要将爬虫技术应用到‘一带一路’这个题目上,需要解决更多的问题:
1、有效网站选择。‘一带一路’的相关数据分散在各个网站,如何从庞大的网站目录中寻找到包含了‘一带一路’关键信息和报道的有效网站?
2、有效数据获取。选定要爬取的目标网站后,还需要从目标网站中获取到有效的数据,由于每个网站的html结构不同,一个一个网站的手动编写爬虫是否耗时耗力,是否有更好的方法?
3、数据分类存储。对爬取到的数据,需要整理归类,形成一个指标体系,比如:投资量、重大项目、政府间合作、公司间合作、政府优惠政策等等。然后分类存储。
added at 20160729
这一周折腾了一个爬虫,希望从新华网获取到一些有用的信息,新华网做了反爬处理,貌似使用的ip访问次数限制的策略,临时的方案,是在请求目标网站20次后,sleep 60s,策略已经凑效,但是效率太低,接下来尝试从根本上解决这个问题。
参考资料:网站常见的反爬虫和应对方法
参照文章如何让你的scrapy爬虫不再被ban之二(利用第三方平台crawlera做scrapy爬虫防屏蔽)注册了crawlera之后,发现是收费的,放弃。
拟采用从线上抓取代理ip的方式,自行处理。常用的代理ip列表网站有:
http://www.freeproxylists.net/zh/
http://www.xroxy.com/proxylist.php
例程测试,使用requests proxy可以正常请求,接下来需要将proxy引入爬虫,同时要考虑代理ip的管理,任务分配等问题。
后面对目标网站做了一次spy,发现其ip限制的策略并不固定,通常封锁ip后,几秒钟之后就能正常访问,因此,决定使用程序不断尝试的方式解决ip封锁问题。
策略简介:当发现目标网站拒绝服务(比如:返回403,返回需要输入图形验证码等等)时,进行自适应尝试,每多尝试一次,sleep按照设定步长增加,直到拿到正确的结果或者timeout。实践下来,基本解决这个问题,代码如下:
def selfAdaptionTry(bizProcess,sleepTime, stepTime, timeout):
tryTime = 1;
while(True):
result = bizProcess()
if not result:
logging.info("The target biz process failed, continue to try. try the %s time",tryTime)
print("The target biz process failed, continue to try. try the time",tryTime)
if timeout:
if sleepTime < timeout:
time.sleep(sleepTime)
timeout = timeout - sleepTime
sleepTime = sleepTime+stepTime
else:
logging.error("selfAdaptionTry timeout, break to try.")
return
else:
time.sleep(sleepTime)
sleepTime = sleepTime+stepTime
else:
return result
tryTime = tryTime+1
ip限制的问题貌似还有,这次是网络层面的:
服务端强制关闭了客户端连接,后面需要提高程序的健壮性,捕获异常并处理。
r = requests.get(url,headers=headers);
File "D:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-p ackages\requests\api.py", line 71, in get
return request('get', url, params=params, **kwargs)
File "D:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-p ackages\requests\api.py", line 57, in request
return session.request(method=method, url=url, **kwargs)
File "D:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-p ackages\requests\sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "D:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-p ackages\requests\sessions.py", line 585, in send
r = adapter.send(request, **kwargs)
File "D:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-p ackages\requests\adapters.py", line 453, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetErro
r(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))
昨晚跑了一晚上,发现如下两个问题:
1、status_code=200,但是html content却是404 error页面;
2、status_code=200,返回一个cdn的告示页面。
<HTML><HEAD><script>window.location.href=\'target url\';</script></HEAD><BODY><BR clear="all"><HR noshade size="1px"><ADDRESS>Generated Mon, 01 Aug 2016 14:09:19 GMT by cache.51cdn.com (Cdn Cache Server V2.0)</ADDRESS></BODY></HTML>
将上述两个问题列为该站点的特殊逻辑处理,在解析html content时当做失败处理,弃之不理,继续轮询目标网站。
分析昨晚的日志发现,目标网站的ip封锁冻结期为10分钟,10分钟后可以继续正常访问。
added by 20160731
昨天一直返回404error页面,估计ip被封了。今晚换了一个代理ip,可以成功访问。看来还是要将proxy机制引入。
另外,通过浏览器仍然是可以访问的,需要研究下。