scrapy+selenium+chrome headless

2023年9月1日 344次阅读来源: 羽煊

在用scrpay写爬虫的时候对于一些js动态页面会需要一些自动化的工具来分析页面，selenium+phantomJs 是一个不错的选择，但是在使用过程中发现了一个很头痛的问题，当解析页面超时时，phantomJs就一直卡死。对于selenium + chrome这种我一直都很排斥，原因是要打开浏览器的可视化界面。
正当一筹莫展时发现了chrome headless。headless模式支持Chromium和Blink渲染引擎提供的所有现代网页平台的特征都转化成命令行。
代码如下：

from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.webdriver.chrome.options import Options

class SeleniumMiddleware(object):
  def process_request(self,request,spider):
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.binary_locaion = '/usr/bin/google-chrome-stable'
    capabilities = {}
    capabilities['platform'] = 'Linux'
    capabilities['version'] = '16.04'
    if spider.name == "music163":
      print 'start parse {0}'.format(request.url)
      #driver = webdriver.PhantomJS()
      driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver',chrome_options=options,desired_capabilities=capabilities)
      try:
        driver.get(request.url)
        driver.switch_to.frame('g_iframe')
        body = driver.page_source
        print 'finished parse {0}'.format(request.url)
        return HtmlResponse(driver.current_url,body=body,encoding='utf-8',request=request)
      except:
        driver.quit()

selenium要使用chrome需要chromedriver驱动协助，ChromeDriver通过chrome的自动代理框架控制浏览器，chromedriver和chrome的版本要对于匹配，否则无法运行chrome.
chromedriver和chrome的版本映射表，见： http://chromedriver.storage.googleapis.com/2.33/notes.txt

    原文作者：羽煊
    原文地址: https://www.jianshu.com/p/78af96c883a5
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。