前言
- 展示如何使用Scrapy爬取静态数据和Selenium+Headless Chrome爬取JS动态生成的数据,从而爬取完整的Google Play印尼市场的应用数据。
- 注意不同国家的数据格式不一样,解析的方法也不一样。如果爬取其他国家市场的应用数据,需要修改解析数据的代码(在下文的
GooglePlaySpider.py
文件中)。 - 项目的运行环境:
- 运行平台:macOS
- Python版本:3.6
- IDE:Sublime Text
安装
- Scrapy,爬虫框架
$ sudo easy_install pip
$ pip install scrapy
- Selenium,浏览器自动化框架
$ pip install selenium
- Chrome Driver,浏览器驱动
- 直接下载解压
- Chrome,浏览器
$ curl https://intoli.com/install-google-chrome.sh | bash
- SQLAlchemy,SQL框架
$ pip install sqlalchemy
$ pip install sqlalchemy_utils
- MySQL
通过Scrapy创建项目
$ scrapy startproject gp
定义爬虫数据Item
在
items.py
文件添加:# 产品 class ProductItem(scrapy.Item): gp_icon = scrapy.Field() # 图标 gp_name = scrapy.Field() # GP名称 // ... # 评论 class GPReviewItem(scrapy.Item): avatar_url = scrapy.Field() # 头像链接 user_name = scrapy.Field() # 用户名称 // ...
创建爬虫
在
spiders
文件夹创建GooglePlaySpider.py
:import scrapy from gp.items import ProductItem, GPReviewItem class GooglePlaySpider(scrapy.Spider): name = 'gp' allowed_domains = ['play.google.com'] def __init__(self, *args, **kwargs): urls = kwargs.pop('urls', []) # 获取参数 if urls: self.start_urls = urls.split(',') print('start urls = ', self.start_urls) def parse(self, response): print('Begin parse ', response.url) item = ProductItem() content = response.xpath('//div[@class="LXrl4c"]') try: item['gp_icon'] = response.urljoin(content.xpath('//img[@class="T75of ujDFqe"]/@src')[0].extract()) except Exception as error: exception_count += 1 print('gp_icon except = ', error) item['gp_icon'] = '' try: item['gp_name'] = content.xpath('//h1[@class="AHFaub"]/span/text()')[0].extract() except Exception as error: exception_count += 1 print('gp_name except = ', error) item['gp_name'] = '' // ... yield item
运行爬虫:
$ scrapy crawl gp -a urls='https://play.google.com/store/apps/details?id=id.danarupiah.weshare.jiekuan&hl=id'
评论数据:
'gp_review': []
获取不到评论数据的原因是:评论数据是通过JS代码动态生成的,所以需要模拟浏览器请求网页获取。
通过Selenium+Headless Chrome获取评论数据
在最里面的
gp
文件夹创建配置文件configs.py
并添加浏览器路径:# 浏览器路径 CHROME_PATH = r'' # 可以指定绝对路径,如果不指定的话会在$PATH里面查找 CHROME_DRIVER_PATH = r'' # 可以指定绝对路径,如果不指定的话会在$PATH里面查找
在
middlewares.py
文件创建ChromeDownloaderMiddleware
:from scrapy.http import HtmlResponse from selenium import webdriver from selenium.common.exceptions import TimeoutException from gp.configs import * class ChromeDownloaderMiddleware(object): def __init__(self): options = webdriver.ChromeOptions() options.add_argument('--headless') # 设置无界面 if CHROME_PATH: options.binary_location = CHROME_PATH if CHROME_DRIVER_PATH: self.driver = webdriver.Chrome(chrome_options=options, executable_path=CHROME_DRIVER_PATH) # 初始化Chrome驱动 else: self.driver = webdriver.Chrome(chrome_options=options) # 初始化Chrome驱动 def __del__(self): self.driver.close() def process_request(self, request, spider): try: print('Chrome driver begin...') self.driver.get(request.url) # 获取网页链接内容 return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8', status=200) # 返回HTML数据 except TimeoutException: return HtmlResponse(url=request.url, request=request, encoding='utf-8', status=500) finally: print('Chrome driver end...')
在
settings.py
文件添加:DOWNLOADER_MIDDLEWARES = { 'gp.middlewares.ChromeDownloaderMiddleware': 543, }
再次运行爬虫:
$ scrapy crawl gp -a urls='https://play.google.com/store/apps/details?id=id.danarupiah.weshare.jiekuan&hl=id'
评论数据:
'gp_review': [{'avatar_url': 'https://lh3.googleusercontent.com/-RZM2NdsDoWQ/AAAAAAAAAAI/AAAAAAAAAAA/ACLGyWCJIbUq9MxjbT2dmsotE2knI_t1xQ/s48-c-rw-mo/photo.jpg', 'rating_star': '5', 'review_text': 'Euis Suharani', 'user_name': 'Euis Suharani'}, {'avatar_url': 'https://lh3.googleusercontent.com/-ppBNQHj5SUs/AAAAAAAAAAI/AAAAAAAAAAA/X8z6OBBBnwc/s48-c-rw/photo.jpg', 'rating_star': '3', 'review_text': 'Pengguna Google', 'user_name': 'Pengguna Google'}, {'avatar_url': 'https://lh3.googleusercontent.com/-lLkaJ4GjUhY/AAAAAAAAAAI/AAAAAAAABfA/UPoS4CbDOpQ/s48-c-rw/photo.jpg', 'rating_star': '5', 'review_text': 'novi anna', 'user_name': 'novi anna'}, {'avatar_url': 'https://lh3.googleusercontent.com/-XZDMrSc_pxE/AAAAAAAAAAI/AAAAAAAAAAA/awl5OkP7uR4/s48-c-rw/photo.jpg', 'rating_star': '4', 'review_text': 'Pengguna Google', 'user_name': 'Pengguna Google'}]
使用sqlalchemy
操作MySQL
在配置文件
configs.py
添加数据库连接信息:# 数据库连接信息 DATABASES = { 'DRIVER': 'mysql+pymysql', 'HOST': '127.0.0.1', 'PORT': 3306, 'NAME': 'gp', 'USER': 'root', 'PASSWORD': 'root', }
在最里面的
gp
文件夹创建数据库连接文件connections.py
:from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker from sqlalchemy_utils import database_exists, create_database from gp.configs import * # sqlalchemy model 基类 Base = declarative_base() # 数据库连接引擎,用来连接数据库 def db_connect_engine(): engine = create_engine("%s://%s:%s@%s:%s/%s?charset=utf8" % (DATABASES['DRIVER'], DATABASES['USER'], DATABASES['PASSWORD'], DATABASES['HOST'], DATABASES['PORT'], DATABASES['NAME']), echo=False) if not database_exists(engine.url): create_database(engine.url) # 创建库 Base.metadata.create_all(engine) # 创建表 return engine # 数据库会话,用来操作数据库表 def db_session(): return sessionmaker(bind=db_connect_engine())
在最里面的
gp
文件夹创建sqlalchemy model
文件models.py
:from sqlalchemy import Column, ForeignKey from sqlalchemy.dialects.mysql import TEXT, INTEGER from sqlalchemy.orm import relationship from gp.connections import Base class Product(Base): # 表的名字: __tablename__ = 'product' # 表的结构: id = Column(INTEGER, primary_key=True, autoincrement=True) # ID updated_at = Column(INTEGER) # 最后一次更新时间 gp_icon = Column(TEXT) # 图标 gp_name = Column(TEXT) # GP名称 // ... class GPReview(Base): # 表的名字: __tablename__ = 'gp_review' # 表的结构: id = Column(INTEGER, primary_key=True, autoincrement=True) # ID product_id = Column(INTEGER, ForeignKey(Product.id)) avatar_url = Column(TEXT) # 头像链接 user_name = Column(TEXT) # 用户名称 // ...
在
pipelines.py
文件添加数据库操作代码:from gp.connections import * from gp.items import ProductItem from gp.models import * class GoogleplayspiderPipeline(object): def __init__(self): self.session = db_session() def process_item(self, item, spider): print('process item from gp url = ', item['gp_url']) if isinstance(item, ProductItem): session = self.session() model = Product() model.gp_icon = item['gp_icon'] model.gp_name = item['gp_name'] // ... try: m = session.query(Product).filter(Product.gp_url == model.gp_url).first() if m is None: # 插入数据 print('add model from gp url ', model.gp_url) session.add(model) session.flush() product_id = model.id for review in item['gp_review']: r = GPReview() r.product_id = product_id r.avatar_url = review['avatar_url'] r.user_name = review['user_name'] // ... session.add(r) else: # 更新数据 print("update model from gp url ", model.gp_url) m.updated_at = item['updated_at'] m.gp_icon = item['gp_icon'] m.gp_name = item['gp_name'] // ... product_id = m.id session.query(GPReview).filter(GPReview.product_id == product_id).delete() session.flush() for review in item['gp_review']: r = GPReview() r.product_id = product_id r.avatar_url = review['avatar_url'] r.user_name = review['user_name'] // ... session.add(r) session.commit() print('spider_success') except Exception as error: session.rollback() print('gp error = ', error) print('spider_failure_exception') raise finally: session.close() return item
把
settings.py
文件的ITEM_PIPELINES
注释打开:ITEM_PIPELINES = { 'gp.pipelines.GoogleplayspiderPipeline': 300, }
再次运行爬虫:
$ scrapy crawl gp -a urls='https://play.google.com/store/apps/details?id=id.danarupiah.weshare.jiekuan&hl=id'
查看
MySQL
数据库存储的爬虫数据:- 访问
MySQL
:$ mysql -u root -p
,输入密码:root
- 列出所有数据库:
mysql> show databases;
,可以看到新建的gp
- 访问
gp
:mysql> use gp;
- 列出所有的数据表:
mysql> show tables;
,可以看到新建的product
和gp_review
- 查看产品数据:
mysql> select * from product;
- 查看评论数据:
mysql> select * from gp_review;
- 访问