Scrapy增加随机请求头user_agent

  1. 为什么要增加随机请求头:更好地伪装浏览器,防止被Ban。
  • 如何在每次请求时,更换不同的user_agent,Scrapy使用Middleware即可

Spider 中间件(Middleware) 下载器中间件是介入到 Scrapy 的 spider 处理机制的钩子框架,可以添加代码来处理发送给Spiders的 response 及 spider 产生的 item 和 request。

官网说明在这里:Spider Middleware

  • 添加middleware的步骤:
    1)创建一个中间件(RandomAgentMiddleware)
    设置请求时使用随机user_agent
  1. 在settings.py中配置,激活中间件。
    网上文章基本上转的都是下面这段代码:

    《Scrapy增加随机请求头user_agent》

  • 这段代码中的疑问:
    1)自己写的Middleware放在哪个目录下
    2)settings.py中的MIDDLEWARES的路径是如何定

    1)
    自己编写的中间件放在items.py和settings.py的同一级目录。

2)
settings.py中的MIDDLEWARES的路径,应该是:

     yourproject.middlewares(文件名).middleware类

如果你的中间件的类名和文件名都使用了RandomUserAgentMiddleware,那这个路径应该写成:

xiaozhu.RandomUserAgentMiddleware.RandomUserAgentMiddleware

这一点,大家可以比较引入自己写的pipelines,只不过Scrapy框架本身为我们创建了一个pipelines.py

3) 在middleware中间件中导入settings中的USER_AGENT_LIST
我使用的是mac,因为settings.py与RandomUserAgentMiddleware在同一级目录

   from settings import USER_AGENT_LIST

Scrapy增加随机user_agent的完整代码:

from settings import USER_AGENT_LIST

import random
from scrapy import log

class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua  = random.choice(USER_AGENT_LIST)
        if ua:
            request.headers.setdefault('User-Agent', ua)

settings.py中:

USER_AGENT_LIST=[
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]


DOWNLOADER_MIDDLEWARES = {
    'xiaozhu.user_agent_middleware.RandomUserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}


代码Github: https://github.com/ppy2790/xiaozhu

点赞