python3的爬虫笔记5——代理IP和时间设置、异常处理

2019年5月18日 160次阅读来源: X_xxieRiemann

对于反爬虫机制的处理，除了笔记2中伪造浏览器的方法，还可以使用代理IP和时间设置

一、代理IP

适用情况：限制IP地址情况，也可解决由于“频繁点击”而需要输入验证码登陆的情况。
这种情况最好的办法就是维护一个代理IP池，网上有很多免费的代理IP，良莠不齐，可以通过筛选找到能用的。对于“频繁点击”的情况，我们还可以通过限制爬虫访问网站的频率来避免被网站禁掉。
这里推荐几个IP免费代理的网站：
（1） http://www.xicidaili.com/
（2） http://haoip.cc/tiqu.htm
对于requests方法：

import requests
#ip地址：端口号
proxies = {'http' : 'http://XX.XX.XX.XX:XXXX'}
#或者
proxies = {'http' : 'XX.XX.XX.XX:XXXX'}
response = requests.get(url=url, proxies=proxies)

我们实际应用下，用站长之家的查ip工具，看看实际ip到底有没有进行修改。

import requests
#一个能够查当前ip的网站
url = 'http://ip.chinaz.com/'
#用proxies字典保存代理ip
proxies = {'http' : '218.86.128.100:8118'}
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.103 Safari/537.36', 'Connection':'keep-alive'}
response = requests.get(url=url, proxies=proxies,headers=headers)
response.encoding = 'utf-8'
html = response.text
print(html)

对于urllib方法：

import urllib.request
#ip地址：端口号
proxies = {'http' : 'http://XX.XX.XX.XX:XXXX'}
#或者
proxies = {'http' : 'XX.XX.XX.XX:XXXX'}
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support)
# 安装opener，此后调用urlopen()时都会使用安装过的opener对象
urllib.request.install_opener(opener) 
response = urllib.request.urlopen(url)

import urllib.request
url = 'http://ip.chinaz.com/'
proxies = {'http' : '218.86.128.100:8118'}
proxy_support = urllib.request.ProxyHandler(proxies)
#创建opener
opener = urllib.request.build_opener(proxy_support)
# 安装opener，此后调用urlopen()时都会使用安装过的opener对象
urllib.request.install_opener(opener)
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')
print(html)

我们可以看到，再返回的html中，我们的ip已经发生了改变。

《python3的爬虫笔记5——代理IP和时间设置、异常处理》

如果能多个ip随机切换的话，我们爬虫的强壮程度会更高，接下来简单说说随机切换ip。

import random
#把我们从ip代理网站上得到的ip，用ip地址：端口号的格式存入iplist数组
iplist = ['XXX.XXX.XXX.XXX:XXXX', 'XXX.XXX.XXX.XXX:XXXX']
proxies ={'http': random.choice(iplist)}

二、时间设置

适用情况：限制频率情况。
Requests，Urllib都可以使用time库的sleep()函数：

import time
time.sleep(1)#单位：秒

三、Timeout

timeout的设置，可以设置等待多久超时，为了解决一些网站实在响应过慢而造成的影响。

import requests
response =  requests.get(url, timeout=10)

import urllib.request
response =  urllib.request.urlopen(url, timeout=10)

四、异常处理

从requests的官方文档我们看到：

《python3的爬虫笔记5——代理IP和时间设置、异常处理》

简单地写个异常处理：

from requests.exceptions import RequestException
try:
    XXXXX
except RequestException as e:
    print('爬虫错误，错误原因：',e)

参考：
https://github.com/lining0806/PythonSpiderNotes

    原文作者：X_xxieRiemann
    原文地址: https://www.jianshu.com/p/0fd9d52d4347
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。