scrapy提供了两种中间件,下载中间件(Downloader Middleware)和Spider中间件(Spider Middleware)
下载中间件
下载中间件是scrapy提供用于用于在爬虫过程中可修改Request和Response,用于扩展scrapy的功能;比如:
- 可以在请求被Download之前,请求头部加上某些信息;
- 完成请求之后,回包需要解压等处理;
如何激活下载中间:
在配置文件settings.py
中的DOWNLOADER_MIDDLEWARES
中配置键值对,键为要打开的中间件,值为数字,代表优先级,值越低,优先级越高。
scrapy还有一个内部自带的下载中间件配置DOWNLOADER_MIDDLEWARES_BASE
(不可覆盖)。scrapy在启用是会结合DOWNLOADER_MIDDLEWARES_BASE
和DOWNLOADER_MIDDLEWARES
,若要取消scrapy默认打开的中间,可在DOWNLOADER_MIDDLEWARES
将该中间的值置为0。
DOWNLOADER_MIDDLEWARES_BASE =
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
如何编写一个下载中间件
class Scrapy.downloadermiddleares.DownloaderMiddleware
process_request(request, spider)
当每个Request对象经过下载中间件时会被调用,优先级越高的中间件,越先调用;该方法应该返回以下对象:None
/Response
对象/Request
对象/抛出IgnoreRequest
异常;
- 返回
None
:scrapy会继续执行其他中间件相应的方法; - 返回
Response
对象:scrapy不会再调用其他中间件的process_request
方法,也不会去发起下载,而是直接返回该Response
对象; - 返回
Request
对象:scrapy不会再调用其他中间件的process_request()
方法,而是将其放置调度器待调度下载; - 抛出
IgnoreRequest
异常:已安装中间件的process_exception()
会被调用,如果它们没有捕获该异常,则Request.errback
会被调用;如果再没被处理,它会被忽略,且不会写进日志。
process_response(request, response, spider)
当每个Response经过下载中间件会被调用,优先级越高的中间件,越晚被调用,与process_request()相反;该方法返回以下对象:Response
对象/Request
对象/抛出IgnoreRequest
异常。
- 返回
Response
对象:scrapy会继续调用其他中间件的process_response
方法; - 返回
Request
对象:停止中间器调用,将其放置到调度器待调度下载; - 抛出
IgnoreRequest
异常:Request.errback
会被调用来处理函数,如果没有处理,它将会被忽略且不会写进日志。
process_exception(request, exception, spider)
当process_exception()
和process_request()
抛出异常时会被调用,应该返回以下对象:None/Response
对象/Request
对象;
- 如果返回
None
:scrapy会继续调用其他中间件的process_exception()
; - 如果返回
Response
对象:中间件链的process_response()
开始启动,不会继续调用其他中间件的process_exception()
; - 如果返回
Request
对象:停止中间器的process_exception()
方法调用,将其放置到调度器待调度下载。
from_crawler(cls, crawler)
如果存在该函数,from_crawler
会被调用使用crawler
来创建中间器对象,必须返回一个中间器对象,通过这种方式,可以访问到crawler
的所有核心部件,如settings
、signals
等。
scray提供的一些下载中间件
以下讲述的是一些常用的下载中间件,更多的下载中间件请查看文档和代码
HttpProxyMiddleware
scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
用以设置代理服务器,通过设定Request.meta['proxy']
来设置代理,会从环境变量http_proxy
、https_proxy
、no_proxy
依次获取;我们以http://httpbin.org/ip的返回来测试下:
#shell 命令
export http_proxy='http://193.112.216.55:1234'
# -*- coding: utf-8 -*-
import scrapy
class ProxySpider(scrapy.Spider):
name = 'proxy'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
运行scrapy crawl proxy --nolog
,获得以下结果:
{"origin":"111.231.115.150, 193.112.216.55"}
返回了我们设置的代理地址IP。
UserAgentMiddleware
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
通过配置项USER_AGENT
设置用户代理;我们以http://httpbin.org/headers的返回来看看测试下:
settings.py
#...
#UserAgentMiddleware默认打开
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
#...
# -*- coding: utf-8 -*-
import scrapy
class UserAgentSpider(scrapy.Spider):
name = 'user_agent'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
print(response.text)
运行scrapy crawl user_agent --nolog
,获得以下结果:
{"headers":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Encoding":"gzip,deflate","Accept-Language":"en","Connection":"close","Host":"httpbin.org","User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"}}
返回了我们设置的用户代理。
使用随机用户代理与IP代理
某些网站会通过检测访问IP
和User-agent
来进行反爬虫,如果检测到来自同一IP的大量请求,会判断该IP正在进行爬虫,故而拒绝请求。有些网站也会检测User-Agent
。我们可以使用多个代理IP
和不同的User-Agent
来对网站数据进行爬取,避免被封禁IP。
我们可以通过继承HttpProxyMiddleware
和UserAgentMiddleware
并修改来使得scrapy使用proxy和user-agent按我们的想法来运行。HttpProxyMiddleware
和UserAgentMiddleware
见httpproxy.py和useragent.py
代码如下:
#middlewares.py
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from scrapy.exceptions import NotConfigured
from collections import defaultdict
from urllib.parse import urlparse
from faker import Faker #引入Faker,pip install faker下载
import random
class RandomHttpProxyMiddleware(HttpProxyMiddleware):
def __init__(self, auth_encoding='latin-1', proxy_list = None):
if not proxy_list:
raise NotConfigured
self.proxies = defaultdict(list)
for proxy in proxy_list:
parse = urlparse(proxy)
self.proxies[parse.scheme].append(proxy) #生成dict,键为协议,值为代理ip列表
@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.get('HTTP_PROXY_LIST'):
raise NotConfigured
http_proxy_list = crawler.settings.get('HTTP_PROXY_LIST') #从配置文件中读取
auth_encoding = crawler.settings.get('HTTPPROXY_AUTH_ENCODING', 'latin-1')
return cls(auth_encoding, http_proxy_list)
def _set_proxy(self, request, scheme):
proxy = random.choice(self.proxies[scheme]) #随机抽取选中协议的IP
request.meta['proxy'] = proxy
class RandomUserAgentMiddleware(object):
def __init__(self):
self.faker = Faker(local='zh_CN')
self.user_agent = ''
@classmethod
def from_crawler(cls, crawler):
o = cls()
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent',self.user_agent)
def process_request(self, request, spider):
self.user_agent = self.faker.user_agent() #获得随机user_agent
request.headers.setdefault(b'User-Agent', self.user_agent)
#settings.py
#...
DOWNLOADER_MIDDLEWARES = {
'newproject.middlewares.RandomHttpProxyMiddleware': 543,
'newproject.middlewares.RandomUserAgentMiddleware': 550,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
}
HTTP_PROXY_LIST = [
'http://193.112.216.55:1234',
'http://118.24.172.34:1234',
]
#...
#anything.py
# -*- coding: utf-8 -*-
import scrapy
import json
import pprint
class AnythingSpider(scrapy.Spider):
name = 'anything'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/anything']
def parse(self, response):
ret = json.loads(response.text)
pprint.pprint(ret)
上面引入了faker
库,该库是用来伪造数据的库,十分方便。我们通过访问http://httpbin.org/anything来得到我们的请求内容;如下:
#scrapy crawl anything --nolog
{'args': {},
'data': '',
'files': {},
'form': {},
'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en',
'Cache-Control': 'max-age=259200',
'Connection': 'close',
'Host': 'httpbin.org',
'User-Agent': 'Opera/8.85.(Windows NT 5.2; sc-IT) Presto/2.9.177 '
'Version/10.00'},
'json': None,
'method': 'GET',
'origin': '193.112.216.55',
'url': 'http://httpbin.org/anything'}
#scrapy crawl anything --nolog
{'args': {},
'data': '',
'files': {},
'form': {},
'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en',
'Cache-Control': 'max-age=259200',
'Connection': 'close',
'Host': 'httpbin.org',
'User-Agent': 'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_12_3) '
'AppleWebKit/5342 (KHTML, like Gecko) '
'Chrome/40.0.810.0 Safari/5342'},
'json': None,
'method': 'GET',
'origin': '118.24.172.34',
'url': 'http://httpbin.org/anything'}
可以看到,我们的spider通过下载中间件,不断的更改了IP
和User-Agent
。
总结
本篇讲述了什么是下载中间件以及如何自定义和启用下载中间件,最后实践了自定义下载中间件。后面将会学习另一个中间件:Spider中间件。