python网络爬虫-使用Urllib

2019年5月20日 142次阅读来源: LAworker

1.使用Urllib

首先了解一下Urllib库，它是最基本的HTTP请求库，不需要额外安装即可使用，它包含四个模块。

–第一个模块request，它是最基本的HTTP请求模块，可以用它来模拟发送一请求，就像在浏览器里输入网址然后敲击回车一样，只需要给库方法传入URL还有额外的参数，就可以模拟实现整个过程。

–第二个error模块即异常处理模块，如果出现请求错误，我们可以捕捉这些异常，然后进行重试或其它操作保证不会意外终止。

–第三个parse模块是一个工具模块，提供了许多URL处理方法，比如拆分、解析、合并等等的方法。

–第四个模块是robotparser，主要是用来识别网站的robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬，其实用的比较少。

2.request 发送请求

2.1 urlopen

　urllib.request 模块提供了最基本的构造HTTP请求的方法，利用它可以模拟浏览器的一个请求发起过程，同时它还带有处理authenticaton（授权验证），redirections（重定向)，cookies（浏览器Cookies）以及其它内容。

实例：

import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

结果：

200
[(‘Server’, ‘nginx’), (‘Content-Type’, ‘text/html; charset=utf-8’), (‘X-Frame-Options’, ‘DENY’), (‘Via’, ‘1.1 vegur’), (‘Via’, ‘1.1 varnish’), (‘Content-Length’, ‘49086’), (‘Accept-Ranges’, ‘bytes’), (‘Date’, ‘Mon, 13 May 2019 01:27:18 GMT’), (‘Via’, ‘1.1 varnish’), (‘Age’, ‘218’), (‘Connection’, ‘close’), (‘X-Served-By’, ‘cache-iad2127-IAD, cache-hnd18721-HND’), (‘X-Cache’, ‘HIT, HIT’), (‘X-Cache-Hits’, ‘1, 321’), (‘X-Timer’, ‘S1557710838.119871,VS0,VE0’), (‘Vary’, ‘Cookie’), (‘Strict-Transport-Security’, ‘max-age=63072000; includeSubDomains’)]
nginx

urlopen()函数：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

data参数：data参数是可选的，如果要添加data，他要是字节流编码格式的内容，即bytes类型，通过bytes()方法可以进行转化，另外如果传递了这个data参数，它的请求方式就不再是GET方式请求，而是POST。

实例：

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'world': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

《python网络爬虫-使用Urllib》

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "world": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "11", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "json": null, \n  "origin": "61.153.150.104, 61.153.150.104", \n  "url": "https://httpbin.org/post"\n}\n'

结果

2.2 Request

由上我们知道利用 urlopen() 方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求，如果请求中需要加入 Headers 等信息，我们就可以利用更强大的 Request 类来构建一个请求。

request 的参数构造：class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

多个参数实例：

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0(compatible;MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dic = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dic), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

《python网络爬虫-使用Urllib》

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0(compatible;MSIE 5.5; Windows NT)"
  }, 
  "json": null, 
  "origin": "61.153.150.104, 61.153.150.104", 
  "url": "https://httpbin.org/post"
}

结果

3.处理异常

3.1URLError

URLError类来自Urllib库的error模块，它继承自OSError类，是error异常模块的基类，由request模块生的异常都可以通过捕捉这个类来处理。

它具有一个属性reason，即返回错误的原因。

实例：

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

《python网络爬虫-使用Urllib》

程序没有直接报错，而是输出了如上内容，这样通过如上操作，我们就可以避免程序异常终止，同时异常得到了有效处理。

3.2 HTTPError

它是URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败等等。

他有三个属性：

–code，返回HTTP Status Code，即状态码，比如404网页不存在，500服务器内部错误等等。

–reason，同父类一样，返回错误的原因。

–headers，返回Request Headers。

实例：

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)

《python网络爬虫-使用Urllib》

Not Found 404 Server: nginx/1.10.3 (Ubuntu)
Date: Mon, 13 May 2019 02:47:07 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

结果

因为 URLError 是 HTTPError 的父类，所以我们可以先选择捕获子类的错误，再去捕获父类的错误所以上述代码更好的写法如下：

from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

4.解析链接

Urllib 库里还提供了 parse 这个模块，它定义了处理 URL 的标准接口，例如实现 URL 各部分的抽取，合并以及链接转换。

4.1urlparse()

用法：urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

　
可以看到它有三个参数：
　urlstring，是必填的，即待解析的URL。　
scheme，是默认的协议（比如http、https等），假如这个链接没有带协议信息，会将这个作为默认的协议。
实例：

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

《python网络爬虫-使用Urllib》

allow_fragments，即是否忽略 fragment，如果它被设置为 False，fragment 部分就会被忽略，它会被解析为 path、parameters 或者 query 的一部分，fragment 部分为空。

实例：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

结果：

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

4.2 urlunparse()

有了 urlparse() 那相应地就有了它的对立方法 urlunparse()。

它接受的参数是一个可迭代对象，但是它的长度必须是 6，否则会抛出参数数量不足或者过多的问题。

实例：

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

《python网络爬虫-使用Urllib》

4.3 urlsplit()

这个和 urlparse() 方法非常相似，只不过它不会单独解析 parameters 这一部分，只返回五个结果。

实例：

from urllib.parse import urlsplit

res = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(res)

《python网络爬虫-使用Urllib》

4.4 urlunsplit()

与 urlunparse() 类似，也是将链接的各个部分组合成完整链接的方法，传入的也是一个可迭代对象，例如列表、元组等等，唯一的区别是，长度必须为 5。

实例：

from urllib.parse import urlunsplit

data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))

《python网络爬虫-使用Urllib》

4.5 urljoin()

有了 urlunparse() 和 urlunsplit() 方法，我们可以完成链接的合并，不过前提必须要有特定长度的对象，链接的每一部分都要清晰分开。

生成链接还有另一个方法，利用 urljoin() 方法我们可以提供一个 base_url（基础链接），新的链接作为第二个参数，方法会分析 base_url 的 scheme、netloc、path 这三个内容对新链接缺失的部分进行补充，作为结果返回。

实例：

《python网络爬虫-使用Urllib》

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

urljoin

《python网络爬虫-使用Urllib》

4.6 urlencode()

urlencode()方法，在构造GET请求参数的时候非常有用。

实例：

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

《python网络爬虫-使用Urllib》

我们首先声明了一个字典，将参数表示出来，然后调用 urlencode() 方法将其序列化为 URL 标准 GET 请求参数。可以看到参数就成功由字典类型转化为 GET 请求参数了。

这个方法非常常用，有时为了更加方便地构造参数，我们会事先用字典来表示，要转化为 URL 的参数时只需要调用该方法即可。

4.7 parse_qs()

有了序列化必然就有反序列化，如果我们有一串 GET 请求参数，我们利用 parse_qs() 方法就可以将它转回字典。

实例：

from urllib.parse import parse_qs

query = 'name=germey&age=22'
print(parse_qs(query))

《python网络爬虫-使用Urllib》

4.8 parse_qsl()

另外还有一个 parse_qsl() 方法可以将参数转化为元组组成的列表。

实例：

from urllib.parse import parse_qsl

query = 'name=germey&age=22'
print(parse_qsl(query))

《python网络爬虫-使用Urllib》

可以看到运行结果是一个列表，列表的每一个元素都是一个元组，元组的第一个内容是参数名，第二个内容是参数值。

4.9 quote()

quote()方法可以将内容转化为URL编码的格式，有时候URL中带有中文参数的时候可能导致乱码的问题，所以我们可以用这个方法将文字转化成URL编码。

实例：

from urllib.parse import quote

keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

《python网络爬虫-使用Urllib》

4.10 unquote()

有了 quote() 方法当然还有 unquote() 方法，它可以进行 URL 解码。

实例：

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))

《python网络爬虫-使用Urllib》

    原文作者：LAworker
    原文地址: https://www.cnblogs.com/lal666/p/10855552.html
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。