aspider
A web scraping micro-framework based on asyncio.
轻量异步爬虫框架aspider,基于asyncio,目的是让编写单页面爬虫更方便更迅速,利用异步特性让爬虫更快(减少在IO上的耗时)
介绍
pip install aspider
Item
对于单页面,只要实现框架定义的 Item 就可以实现对目标数据的抓取:
import asyncio
from aspider import Request
request = Request("https://news.ycombinator.com/")
response = asyncio.get_event_loop().run_until_complete(request.fetch())
# Output
# [2018-07-25 11:23:42,620]-Request-INFO <GET: https://news.ycombinator.com/>
# <Response url[text]: https://news.ycombinator.com/ status:200 metadata:{}>
Spider
对于页面目标较多,需要进行深度抓取时,Spider就派上用场了
import aiofiles
from aspider import AttrField, TextField, Item, Spider
class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')
async def clean_title(self, value):
return value
class HackerNewsSpider(Spider):
start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2']
async def parse(self, res):
items = await HackerNewsItem.get_items(html=res.body)
for item in items:
async with aiofiles.open('./hacker_news.txt', 'a') as f:
await f.write(item.title + '\n')
if __name__ == '__main__':
HackerNewsSpider.start()
支持JS的加载
Request
类也可以很好的工作并返回内容,这里以这个为例演示下抓取需要加载js才可以抓取的例子:
request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
print(response.body)
如果喜欢,可以玩玩看,项目Github地址:aspider