找到免费或者消费的代理网站,拿到获取IP的API,在Middlewares中添加ProxyMiddleware组件,同时在settings.py中使能。
具体代码如下:
import re
import urllib.request
import random
class ProxyMiddleware(object):
logger = logging.getLogger(__name__)
def get_random_ip(self):
# order是订单号或者序列号
order = "xxxxxxxxxxxxxxxxxxx"
APIurl = "http://xxxxxxxxxxxxxxxxxxxx" + order + ".html"
res = urllib.request.urlopen(APIurl).read().decode("utf-8")
IPs = res.split("\n")
proxyip = random.choices(IPs)
# print(proxyip)
return 'http://' + proxyip
def process_request(self, request, spider):
ip = self.get_random_ip()
print("Current IP:Port is %s" % ip)
request.meta['proxy'] = ip
def process_response(self, request, response, spider):
return response
settings中的设置如下:
DOWNLOADER_MIDDLEWARES = {
"example.middlewares.RotateUserAgentMiddleware": 100,
# 下面对应我们自己编写的ProxyMiddleware, 其后数字越小表示优先级越高,越先执行
"example.middlewares.ProxyMiddleware": 110,
'example.middlewares.ExampleDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
}
具体如何使用代理ip还要根据调用接口的频率来定,一般的API会限制每分钟提取的ip个数,这就涉及到更加具体的编程,结合自己的爬取需求和经济能力。