Scrapy学习笔记(3)-循环爬取以及数据库操作

前言

系统环境:CentOS7

本文假设你已经安装了virtualenv,并且已经激活虚拟环境ENV1,如果没有,请参考这里:使用virtualenv创建python沙盒(虚拟)环境,在上一篇文章(Scrapy学习笔记(2)-使用pycharm在虚拟环境中运行第一个spider)中我们已经能够使用scrapy的命令行工具创建项目以及spider、使用Pycharm编码并在虚拟环境中运行spider抓取http://quotes.toscrape.com/中的article和author信息, 最后将抓取的信息存入txt文件,上次的spider只能单页爬取,今天我们在上次的spider上再深入一下。

目标

跟踪next(下一页)链接循环爬取http://quotes.toscrape.com/中的article和author信息,将结果保存到mysql数据库中。

正文

1.因为要用Python操作MySQL数据库,所以先得安装相关的Python模块,本文使用MySQLdb

#sudo yum install mysql-devel

#pip install mysql-devel

2.在数据库中创建目标表quotes,建表语句如下:

CREATE TABLE `quotes` (

  `id` int(11) NOT NULL AUTO_INCREMENT,

  `article` varchar(500) DEFAULT NULL,

  `author` varchar(50) DEFAULT NULL,

  PRIMARY KEY (`id`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8;

3.items.py文件详细代码如下:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class QuotesItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    article=scrapy.Field()

    author=scrapy.Field()

    pass

4.修改quotes_spider.py如下:

# -*- coding: utf-8 -*-

import scrapy

from ..items import QuotesItem

from urlparse import urljoin

from scrapy.http import Request

class QuotesSpiderSpider(scrapy.Spider):

    name = “quotes_spider”

    allowed_domains = [“quotes.toscrape.com”]

    start_urls = [‘http://quotes.toscrape.com’]

    def parse(self, response):

        articles=response.xpath(“//div[@class=’quote’]”)

        next_page=response.xpath(“//li[@class=’next’]/a/@href”).extract_first()

        for article in articles:

            item=QuotesItem()

            content=article.xpath(“span[@class=’text’]/text()”).extract_first()

            author=article.xpath(“span/small[@class=’author’]/text()”).extract_first()

            item[‘article’]=content.encode(‘utf-8’)

            item[‘author’] = author.encode(‘utf-8’)

            yield item#使用yield返回结果但不会中断程序执行

        if next_page:#判断是否存在next链接

            url=urljoin(self.start_urls[0],next_page)#拼接url

            yield Request(url,callback=self.parse)

5.修改pipelines.py文件,将爬取到的数据保存到数据库

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don’t forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from twisted.enterprise import adbapi

import MySQLdb

import MySQLdb.cursors

class QuotesPipeline(object):

    def __init__(self):

        db_args=dict(

            host=”192.168.0.107″,#数据库主机ip

            db=”scrapy”,#数据库名称

            user=”root”,#用户名

            passwd=”123456″,#密码

            charset=’utf8′,#数据库字符编码

            cursorclass = MySQLdb.cursors.DictCursor,#以字典的形式返回数据集

            use_unicode = True,

        )

        self.dbpool = adbapi.ConnectionPool(‘MySQLdb’, **db_args)

    def process_item(self, item, spider):

        self.dbpool.runInteraction(self.insert_into_quotes, item)

        return item

    def insert_into_quotes(self,conn,item):

        conn.execute(

            ”’

            INSERT INTO quotes(article,author)

            VALUES(%s,%s)

            ”’

            ,(item[‘article’],item[‘author’])

        )

6.pipeline.py文件代码不变:

# -*- coding: utf-8 -*-

# Scrapy settings for quotes project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#    http://doc.scrapy.org/en/latest/topics/settings.html

#    http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

#    http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = ‘quotes’

SPIDER_MODULES = [‘quotes.spiders’]

NEWSPIDER_MODULE = ‘quotes.spiders’

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

  ‘quotes.pipelines.QuotesPipeline’: 300,

}

7.开始运行spider

(ENV1) [eason@localhost quotes]$ scrapy crawl quotes_spider

8.检验结果,Done!

《Scrapy学习笔记(3)-循环爬取以及数据库操作》

更多原创文章,尽在金笔头博客

    原文作者:leeyis
    原文地址: https://www.jianshu.com/p/5cd57fadb452
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞