好的,这就是我想要实现的目标:
>使用动态过滤的搜索结果列表调用URL
>点击第一个搜索结果(5 /页)
>抓取标题,段落和图像,并将它们作为json对象存储在单独的文件中,例如
{
“标题”:“个人条目的标题要素”,
“内容”:“个人条目中DOM顺序中的图表和图像”
}
>导航回搜索结果概述页面并重复步骤2 – 3
> 5/5结果后,抓住了下一页(点击分页链接)
>重复步骤2 – 5直到没有输入
到目前为止我所拥有的是:
#import libraries
from selenium import webdriver
from bs4 import BeautfifulSoup
#URL
url = "https://URL.com"
#Create a browser session
driver = webdriver.Chrome("PATH TO chromedriver.exe")
driver.implicitly_wait(30)
driver.get(url)
#click consent btn on destination URL ( overlays rest of the content )
python_consentButton = driver.find_element_by_id('acceptAllCookies')
python_consentButton.click() #click cookie consent btn
#Seleium hands the page source to Beautiful Soup
soup_results_overview = BeautifulSoup(driver.page_source, 'lxml')
for link in soup_results_overview.findAll("a", class_="searchResults__detail"):
#Selenium visits each Search Result Page
searchResult = driver.find_element_by_class_name('searchResults__detail')
searchResult.click() #click Search Result
#Ask Selenium to go back to the search results overview page
driver.back()
#Tell Selenium to click paginate "next" link
#probably needs to be in a sourounding for loop?
paginate = driver.find_element_by_class_name('pagination-link-next')
paginate.click() #click paginate next
driver.quit()
问题
每次Selenium导航回te搜索结果概述页面时,列表计数都会重置
所以它点击第一个条目5次,导航到接下来的5个项目并停止
这可能是一个递归方法的预定案例,不确定.
任何关于如何解决这个问题的建议都表示赞赏.
最佳答案 没有Selenium,您只能使用请求和BeautifulSoup刮擦.它将更快,并将消耗更少的资源:
import json
import requests
from bs4 import BeautifulSoup
# Get 1000 results
params = {"$filter": "TemplateName eq 'Application Article'", "$orderby": "ArticleDate desc", "$top": "1000",
"$inlinecount": "allpages", }
response = requests.get("https://www.cst.com/odata/Articles", params=params).json()
# iterate 1000 results
articles = response["value"]
for article in articles:
article_json = {}
article_content = []
# title of article
article_title = article["Title"]
# article url
article_url = str(article["Url"]).split("|")[1]
print(article_title)
# request article page and parse it
article_page = requests.get(article_url).text
page = BeautifulSoup(article_page, "html.parser")
# get header
header = page.select_one("h1.head--bordered").text
article_json["Title"] = str(header).strip()
# get body content with images links and descriptions
content = page.select("section.content p, section.content img, section.content span.imageDescription, "
"section.content em")
# collect content to json format
for x in content:
if x.name == "img":
article_content.append("https://cst.com/solutions/article/" + x.attrs["src"])
else:
article_content.append(x.text)
article_json["Content"] = article_content
# write to json file
with open(f"{article_json['Title']}.json", 'w') as to_json_file:
to_json_file.write(json.dumps(article_json))
print("the end")