python-3.x – 如何为维基百科页面构建基本的Web爬虫以收集链接？

2023年2月2日 249次阅读

我一直在为初学者观看
python上的Bucky roberts视频,我正在尝试使用视频中相似类型的代码为维基百科页面构建一个基本的网络爬虫.

import requests
from bs4 import BeautifulSoup

def main_page_spider(max_pages):
page_list={1: "Contents",
           2:"Overview",
           3:"Outlines",
           4:"Lists",
           5:"Portals",
           6:"Glossaries",
           7:"Categories",
           8:"Indices",
           9:"Reference",
           10:"Culture",
           11:"Geography",
           12:"Health",
           13:"History",
           14:"Mathematics",
           15:"Nature",
           16:"People",
           17:"Philosophy",
           18:"Religion",
           19:"Society",
           20:"Technology"}
    for page in range(1,max_pages+1):
        if page == 1:
            url = "https://en.wikipedia.org/wiki/Portal:Contents"
        else:
             url = "https://en.wikipedia.org/wiki/Portal:Contents/" + str(page_list[page])
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        divs = soup.find('div', {'class': "mw-body-content", 'id': "bodyContent"})

        for link in divs.findAll('a'):
            href = "https://en.wikipedia.org" + str(link.get("href"))
            get_link_data(href)
            print(href)

def get_link_data(link_url):
    source_code = requests.get(link_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    divs = soup.find('div',{'class': "mw-body-content", 'id': "bodyContent"})
    for link in divs.findAll('a'):
        link_href_data = link.get("href")
        print(link_href_data)

main_page_spider(3)

问题是当我注释掉get_link_data()的函数调用时,程序工作正常,我从我定义的页数得到所有链接.
但是,当我取消注释它时,该程序收集了一些链接并给我错误

socket.gaierror,urllib3.exceptions.NewConnectionError,urllib3.exceptions.MaxRetryError,requests.exceptions.ConnectionError

我该如何解决？

最佳答案任何时候你在抓你都应该引入延迟,以免压倒网站的资源 – 或者你自己的资源.如您所述,使用get_link_data行注释掉运行脚本可生成2763行输出.这是你要尽可能快地抓取的2763个网址.这通常会从站点限制您或从您自己的网络或DNS服务器阻塞的错误中触发错误.

在每次调用get_link_data之前添加一个延迟 – 我建议至少一秒钟.这需要一段时间,但请记住 – 您正在从免费的资源中收集数据.不要滥用它.

您还应该更加选择性地关注您所关注的链接.在2763个URL输出中,只有2291个唯一的 – 这几乎是500页,你将刮两次.跟踪您已经处理过的网址,不要再次请求它们.

您可以进一步细化 – 大约100个URL包含片段(#后面的部分).当像这样抓取时,通常应该忽略片段 – 它们通常只引导浏览器聚焦的位置.如果您从每个网址中删除#及其后面的所有内容,则会留下2189个唯一页面.

你提出的一些链接也是错误的.他们看起来像这样：

https://en.wikipedia.org//en.wikipedia.org/w/index.php?title=Portal:Contents/Outlines/Society_and_social_sciences&action=edit

您可能想要修复这些 – 并且可能完全跳过“编辑”链接.

最后,即使你做了所有这些事情,你也可能遇到一些例外.互联网是一个混乱的地方:)所以你想要包括错误处理.这些方面的东西：

for link in divs.findAll('a'):
    href = "https://en.wikipedia.org" + str(link.get("href"))
    time.sleep(1)
    try:
        get_link_data(href)
    except Exception as e:
        print("Failed to get url {}\nError: {}".format(href, e.__class__.__name__)

“`