Beautiful Soup4学习笔记(一):安装

该系列是按照Beautiful Soup教程抄袭,原文链接:
http://beautifulsoup.readthedocs.io/zh_CN/latest/

工欲善其事,必先利其器。下面我们安装 beautifulsoup4:

#pip install   beautifulsoup4 (Centos系统)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.5.3-py3-none-any.whl (85kB)
    100% |████████████████████████████████| 92kB 669kB/s 
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.3

安装解析器:
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

# pip install lxml
Collecting lxml
  Downloading lxml-3.7.3-cp35-cp35m-manylinux1_x86_64.whl (7.1MB)
    100% |████████████████████████████████| 7.1MB 83kB/s 
Installing collected packages: lxml
Successfully installed lxml-3.7.3

安装完成之后,如何使用:
将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象,可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>

然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.

首先是一段HTML代码的字符串:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

几个浏览结构化数据的方法:

>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story"
>>> soup.title.parent.name
'head'
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup.p['class']
['title']
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/l
acie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find(id="link2")
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

从文档中找到所有<a>标签的链接:

>>> for link in soup.find_all('a'):
...     print(link.get('href')) 
...

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

从文档中获得所有文字:

>>> print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
    原文作者:海贼之路飞
    原文地址: https://www.jianshu.com/p/18abb6f53371
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞