Python Scrapy 爬取姓名大全数据

欢迎来我的个人博客:fizzyi

项目介绍

爬取地址: http://www.resgain.net/xmdq.html

爬取内容:为该网址下的所有姓氏和姓氏名字

爬取步骤:

  • 先爬取所有的姓氏,包括姓氏,姓氏的中文,每个姓氏的URL
  • 然后在进每一个姓氏的网址进去爬取每个姓氏下的名字,每个姓氏下都有十页,但是发现并不是每一页都是存在姓名的。
  • 最后进每一个姓氏的详细页面,爬取每个姓名的相同人数和五行和三才。

工作环境和爬取的框架: python3 scrapy

爬取数据量: 姓氏435个 姓名194万数据

代码

1 准备工作

新建scrapy项目 和新建爬虫项目(2个)

scrapy startproject baijiaxing1
scrapy genspider baijiaxing2 resgain.net/xmdq.html
scrapy genspider spider_xingming resgain.net/xmdq.html

解释下为什么要创建两个爬虫项目

因为scrapy是多线程的爬取方式,我之前写在一起,会同时爬取姓氏和姓名,但是姓名中有一个字段是姓氏的id,这样就会存在一个情况 姓名抓取都的时候姓氏还没有存到数据库中,会导致报错。但想一想,应该有其他办法可以解决这个问题,本人也是才接触scrapy,所以采用这种笨方法。

baijiaxing2 是爬取姓氏, spider_xingming 是爬取姓名

2 建立items

Items.py

# -*- coding: utf-8 -*-

import scrapy


class Xingshi_Item(scrapy.Item):
    xingshi = scrapy.Field()
    href = scrapy.Field()
    xingshi_zhongwen = scrapy.Field()


class Xingming_Item(scrapy.Item):
    name = scrapy.Field()
    the_same_people_number = scrapy.Field()
    boy_ratio = scrapy.Field()
    girl_ratio = scrapy.Field()
    five_elements = scrapy.Field()
    three_talents = scrapy.Field()
    xingshi = scrapy.Field()

3 爬取所有姓氏

# -*- coding: utf-8 -*-
import scrapy

from baijiaxing1.items import Xingshi_Item


class Baijiaxing2Spider(scrapy.Spider):
    name = 'baijiaxing2'
    # allowed_domains = ['resgain.net/xmdq.html']
    start_urls = ('http://www.resgain.net/xmdq.html',)

    def parse(self, response):
        content = response.xpath('//div[@class="col-xs-12"]/a')

        for i in content:
            # xingshi = i.split('.')[0].split('/')[-1]
            xingshi = i.xpath('./@href').extract()[0].split('.')[0].split('/')[-1]

            href = 'http:' + i.xpath('./@href').extract()[0]
            item = Xingshi_Item()
            item['xingshi'] = xingshi
            item['href'] = href
            item['xingshi_zhongwen'] = i.xpath('./text()').extract()[0].split('姓名')[0]

            yield item

4 爬取所有名字

# -*- coding: utf-8 -*-
from urllib.parse import urljoin

import scrapy

from baijiaxing1.items import Xingshi_Item, Xingming_Item


class SpiderXingmingSpider(scrapy.Spider):
    name = 'spider_xingming'
    # allowed_domains = ['www.resgain.net/xmdq.html']
    start_urls = ('http://www.resgain.net/xmdq.html',)

    def parse(self, response):
        content = response.xpath('//div[@class="col-xs-12"]/a/@href').extract()

        for i in content:
            page = 0
            href = 'http:' + i
            base = href.split('/name')[0] + '/name_list_'
            while page < 10:
                url = base + str(page) + '.html'
                page += 1
                yield scrapy.Request(url, callback=self.parse_in_html)

    # 解析每一页
    def parse_in_html(self, response):
        person_info = response.xpath('//div[@class="col-xs-12"]/a')
        base_url = 'http://'+response.url.split('/')[2]
        xingshi = response.url.split('/')[2].split('.')[0]
        for every_one in person_info:
            name = every_one.xpath('./text()').extract()[0]
            href = every_one.xpath('./@href').extract()[0]
            the_person_info_url =base_url + href
            the_item = Xingming_Item()
            the_item['name'] = name
            the_item['xingshi'] = xingshi
            yield scrapy.Request(the_person_info_url, meta={'the_item': the_item}, callback=self.parse_every_html)


    def parse_every_html(self, response):
        the_item = response.meta['the_item']
        the_same_people_number = \
        response.xpath('//div[@class="navbar-brand"]/text()').extract_first().split('人')[0].split('有')[1]
        boy_ratio = \
        response.xpath('//div[@class="progress"]/div[contains(@class,progress-bar)]/text()').extract()[0].split('情况')[0]
        girl_ratio = \
        response.xpath('//div[@class="progress"]/div[contains(@class,progress-bar)]/text()').extract()[1].split('情况')[0]
        five_elements = response.xpath('//div[@class="panel-body"]/div[@class="col-xs-6"]/blockquote/text()').extract()[
            0]
        three_talents = response.xpath('//div[@class="panel-body"]/div[@class="col-xs-6"]/blockquote/text()').extract()[
            1]
        the_item['the_same_people_number'] = the_same_people_number,
        the_item['boy_ratio'] = boy_ratio,
        the_item['girl_ratio'] = girl_ratio,
        the_item['five_elements'] = five_elements,
        the_item['three_talents'] = three_talents

        yield the_item

5 pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql

from baijiaxing1.items import Xingshi_Item, Xingming_Item


class XingShiPipeline(object):
    def __init__(self,host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port

    def process_item(self, item, spider):
        if isinstance(item, Xingshi_Item):
            sql = 'INSERT INTO baijiaxing(xingshi,href,xingshi_zhongwen) VALUES (%s,%s,%s);'
            self.cursor.execute(sql,(item['xingshi'],str(item['href']),item['xingshi_zhongwen']))
            self.db.commit()
            return item
        elif isinstance(item,Xingming_Item):
            sql = 'SELECT id FROM xingshi WHERE xingshi = %s' %item['xingshi']
            xingshi_id = self.cursor.execute(sql)
            sql = 'INSERT INTO xingming(name,the_same_people_number,boy_ratio,girl_ratio,five_elements,three_talents,xingshi_id)values (%s,%s,%s,%s,%s,%s,%s);'
            self.cursor.execute(sql,(item['name'],item['the_same_people_number'][0],item['boy_ratio'][0],item['girl_ratio'][0],item['five_elements'][0],item['three_talents'],int(xingshi_id)))
            self.db.commit()
            return item

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('PYMYSQL_HOST'),
            database=crawler.settings.get('PYMYSQL_DATABASE'),
            user=crawler.settings.get('PYMYSQL_USER'),
            password=crawler.settings.get('PYMYSQL_PASSWORD'),
            port=crawler.settings.get('PYMYSQL_PORT'),
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, port=self.port)
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()


因为有两个item,所以在process_item中要区分是哪个item返回的数据。

6 settings.py

最后在settings.py中设置数据库的配置以及请求头的配置

USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
ROBOTSTXT_OBEY = False
# mysql数据库设置
PYMYSQL_HOST = '127.0.0.1'
PYMYSQL_DATABASE = 'test1'
PYMYSQL_USER = 'root'
PYMYSQL_PASSWORD = '123456'
PYMYSQL_PORT = 3306

7 数据库表

Baijiaxing 表

idxingshihrefxingshi_zhongwen

xingming表

idnamethe_same_people_numberboy_ratiogirl_ratiofive_elementsthree_talentsxingshi_id

8 开始

因为有两个spiders,而且是先运行baijiaxing1 后运行spider_xingming 所以写了一个run.py文件

run.py

import os
os.system("scrapy crawl baijiaxing2")
os.system("scrapy crawl spider_xingming")

github地址:https://github.com/Fizzyi/baijiaxing/tree/master

    原文作者:Fizz翊
    原文地址: https://www.jianshu.com/p/18bae338949a
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞