使用Selenium导航并使用Python中的BeautifulSoup进行刮擦

2023年1月10日 312次阅读

好的,这就是我想要实现的目标：

>使用动态过滤的搜索结果列表调用URL
>点击第一个搜索结果(5 /页)
>抓取标题,段落和图像,并将它们作为json对象存储在单独的文件中,例如

{
“标题”：“个人条目的标题要素”,
“内容”：“个人条目中DOM顺序中的图表和图像”
}
>导航回搜索结果概述页面并重复步骤2 – 3
> 5/5结果后,抓住了下一页(点击分页链接)
>重复步骤2 – 5直到没有输入

再次想象一下这些内容：

到目前为止我所拥有的是：

#import libraries
from selenium import webdriver
from bs4 import BeautfifulSoup

#URL
url = "https://URL.com"

#Create a browser session
driver = webdriver.Chrome("PATH TO chromedriver.exe")
driver.implicitly_wait(30)
driver.get(url)

#click consent btn on destination URL ( overlays rest of the content )
python_consentButton = driver.find_element_by_id('acceptAllCookies')
python_consentButton.click() #click cookie consent btn

#Seleium hands the page source to Beautiful Soup
soup_results_overview = BeautifulSoup(driver.page_source, 'lxml')


for link in soup_results_overview.findAll("a", class_="searchResults__detail"):

  #Selenium visits each Search Result Page
  searchResult = driver.find_element_by_class_name('searchResults__detail')
  searchResult.click() #click Search Result

  #Ask Selenium to go back to the search results overview page
  driver.back()

#Tell Selenium to click paginate "next" link 
#probably needs to be in a sourounding for loop?
paginate = driver.find_element_by_class_name('pagination-link-next')
paginate.click() #click paginate next

driver.quit()

问题
每次Selenium导航回te搜索结果概述页面时,列表计数都会重置
所以它点击第一个条目5次,导航到接下来的5个项目并停止

这可能是一个递归方法的预定案例,不确定.

任何关于如何解决这个问题的建议都表示赞赏.

最佳答案没有Selenium,您只能使用请求和BeautifulSoup刮擦.它将更快,并将消耗更少的资源：

import json
import requests
from bs4 import BeautifulSoup

# Get 1000 results
params = {"$filter": "TemplateName eq 'Application Article'", "$orderby": "ArticleDate desc", "$top": "1000",
          "$inlinecount": "allpages", }
response = requests.get("https://www.cst.com/odata/Articles", params=params).json()

# iterate 1000 results
articles = response["value"]
for article in articles:
    article_json = {}
    article_content = []

    # title of article
    article_title = article["Title"]
    # article url
    article_url = str(article["Url"]).split("|")[1]
    print(article_title)

    # request article page and parse it
    article_page = requests.get(article_url).text
    page = BeautifulSoup(article_page, "html.parser")

    # get header
    header = page.select_one("h1.head--bordered").text
    article_json["Title"] = str(header).strip()
    # get body content with images links and descriptions
    content = page.select("section.content p, section.content img, section.content span.imageDescription, "
                          "section.content  em")
    # collect content to json format
    for x in content:
        if x.name == "img":
            article_content.append("https://cst.com/solutions/article/" + x.attrs["src"])
        else:
            article_content.append(x.text)

    article_json["Content"] = article_content

    # write to json file
    with open(f"{article_json['Title']}.json", 'w') as to_json_file:
         to_json_file.write(json.dumps(article_json))

  print("the end")