spider:解析downloader返回的response,产生爬取项scraped item,产生额外的爬取请求
item piplines:以流水线形式处理spider产生的爬取项,清理,检验,去重,将数据存储到数据库。
download middleware:修改engine,scheduler,downloader的请求或响应
scrapy -h startproject, genspider,settings,crawl,list,shell
1:建立一个爬虫工程和模板: scrapy startproject BaiduStocks
2:编写spider : cd BaiduStocks scrapy genspider example example.com
3:编写 item pipeline
4:优化配置策略
request 类 class scrapy.http.Reqeust() 属性和方法:.url, .method, .headers, .body, .meta, .copy()
response类 class scrapy.http.Response()属性和方法:.url, .status, .headers, .body, .flags, .request, .copy()
scrapy 支持多种html解析方法:Beatiful Soup, lxml, re, XPath Selector, CSS Selector.
def gen(n):
for i in range(n):
yield i**2