Selenium Python – 获取所有加载的URL列表(图像,脚本,样式表等)

2023年3月20日 299次阅读

当Google Chrome通过Selenium加载网页时,它可能会加载页面所需的其他文件,例如来自< img src =“example.com/a.png”>或< script src =“example.com/a.js”>标签.另外,CSS文件.

如何获取浏览器加载页面时下载的所有URL列表？ (以编程方式,使用Python中的Selenium和chromedriver)
也就是说,Chrome中开发者工具的“网络”标签中显示的文件列表(显示下载文件列表).

使用Selenium,chromedriver的示例代码：

from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = "/usr/bin/x-www-browser"
driver = webdriver.Chrome("./chromedriver", chrome_options=options)
# Load some page
driver.get("https://example.com")
# Now, how do I see a list of downloaded URLs that took place when loading the page above?

最佳答案您可能想要查看BrowserMob Proxy.它可以捕获Web应用程序的性能数据(通过HAR格式),以及操纵浏览器行为和流量,例如将内容列入白名单和黑名单,模拟网络流量和延迟,以及重写HTTP请求和响应.

取自readthedocs,用法很简单,它与selenium webdriver api很好地集成.您可以阅读有关BMP here的更多信息.

from browsermobproxy import Server
server = Server("path/to/browsermob-proxy")
server.start()
proxy = server.create_proxy()

from selenium import webdriver
profile  = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)


proxy.new_har("google")
driver.get("http://www.google.co.uk")
proxy.har # returns a HAR JSON blob

server.stop()
driver.quit()