基于scrapy框架的爬虫代理IP设置

找到免费或者消费的代理网站,拿到获取IP的API,在Middlewares中添加ProxyMiddleware组件,同时在settings.py中使能。
具体代码如下:

import re
import urllib.request
import random

class ProxyMiddleware(object):

    logger = logging.getLogger(__name__)

    def get_random_ip(self):
        # order是订单号或者序列号
        order = "xxxxxxxxxxxxxxxxxxx"
        APIurl = "http://xxxxxxxxxxxxxxxxxxxx" + order + ".html"
        res = urllib.request.urlopen(APIurl).read().decode("utf-8")
        IPs = res.split("\n")
        proxyip = random.choices(IPs)        
        # print(proxyip)
        return 'http://' + proxyip

    def process_request(self, request, spider):
        ip = self.get_random_ip()
        print("Current IP:Port is %s" % ip)
        request.meta['proxy'] = ip

    def process_response(self, request, response, spider):
        return response

settings中的设置如下:

DOWNLOADER_MIDDLEWARES = {
    "example.middlewares.RotateUserAgentMiddleware": 100,
    # 下面对应我们自己编写的ProxyMiddleware, 其后数字越小表示优先级越高,越先执行
    "example.middlewares.ProxyMiddleware": 110,
    'example.middlewares.ExampleDownloaderMiddleware': 543,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
}

具体如何使用代理ip还要根据调用接口的频率来定,一般的API会限制每分钟提取的ip个数,这就涉及到更加具体的编程,结合自己的爬取需求和经济能力。

    原文作者:Nise9s
    原文地址: https://www.jianshu.com/p/074c36a7948c
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞