python3 + scrapy爬取妹子图(meizitu.com)

2019年6月11日 506次阅读来源: 慢慢慢慢热

前言

在学会scrapy之前，都是用requests + BeautifulSoup + lxml来爬取的，这样也能爬到想要的东西，但缺点是代码有些乱，可能需要自己对项目进行梳理归类。而scrapy框架很好的解决了这个问题，它将爬虫的整个工序都分离开，各司其职，项目结构看起来很优雅。并且框架提供了很多非常实用的方法，不必再自己去单独写了，简直是良心。爬虫的乐趣在于爬取感兴趣的东西，下面将以爬取妹子图(meizitu.com)来实践下。

了解网站，理清爬虫思路

进入妹子图，可以看到网站首页，中间有一条美女分类的标签，如图：

《python3 + scrapy爬取妹子图(meizitu.com)》 tag.png

然后当点进某个分类之后，会得到很多分页，每个分页有很多图片专辑，点击每个专辑进去就会看到很多图片，这个图片就是我们需要的，那大致思路可以出来了，即：

通过首页(http://www.meizitu.com/)，爬取标签名称tag_name和标签链接tag_href
通过标签链接，爬取当前标签下全部页面page_list
通过页面，爬取当前页面的图片专辑名称album_name和图片专辑链接album_href
通过专辑链接，爬取该专辑里面所有图片名称img_title、图片链接img_src
通过图片链接，使用scrapy自带的图片下载器ImagesPipeline下载图片到设定的文件夹

通过以上思路，可以确定几点，

items应该包含哪些？
毫无疑问，tag_name tag_href page_list album_name album_href imgs img_title 就是需要定义的item
爬虫的入口是什么？
网站首页，即http://www.meizitu.com/
爬虫应该分几层？
根据思路，我们前面4步，都是通过不同的链接爬取相关信息，那爬虫也相应的需要4层。
第一层，爬取标签链接:parse_tag >>>> parse（注：这里开始写的是parse_tag，目的是为了好区分不同的爬取层级和爬取的内容，后来发现爬虫会报错NotImplementedError，经过查找资料，原来在爬虫代码里面，必须要要实现parse方法，不然就会报错，所以这里还是改为parse）
第二层，爬取标签下页面链接:parse_page
第三层，爬取页面下专辑链接:parse_album
第四层，爬取专辑下图片链接:parse_img
怎么保存图片？
scrapy框架提供一个item pipeline来保存图片，即ImagesPipeline，我们只需要重写一个管道继承ImagesPipeline，并且重写get_media_requests(item, info)和item_completed(results, items, info)这两个方法即可

代码实践

1、首先定义`item`

items.py

import scrapy


class MeizituItem(scrapy.Item):
    # 标签名称
    tag_name = scrapy.Field()
    # 标签链接
    tag_href = scrapy.Field()
    # 进入某标签后的所有链接，加页码的
    page_list = scrapy.Field()
    # 图片专辑名称
    album_name = scrapy.Field()
    # 图片专辑链接
    album_href = scrapy.Field()
    # 照片标题
    img_title = scrapy.Field()
    # 照片链接
    img_src = scrapy.Field()

2、完成提取数据代码

mzt.py

# -*- coding: utf-8 -*-
import copy

import scrapy

from meizitu.items import MeizituItem


class MztSpider(scrapy.Spider):
    name = 'mzt'
    allowed_domains = ['meizitu.com']
    start_urls = ['http://www.meizitu.com/']
    
    def parse(self, response):
        """
        提取标签名称和链接
        :param response:
        :return:
        """
        
        tags = response.xpath(".//*[@class='tags']/span/a")
        for i in tags:
            item = MeizituItem()
            tag_href = i.xpath(".//@href").extract()[0]
            tag_name = i.xpath(".//@title").extract()[0]
            item['tag_name'] = tag_name
            item['tag_href'] = tag_href
            yield scrapy.Request(url=item['tag_href'], meta={'item': copy.deepcopy(item)}, callback=self.parse_page)
    
    def parse_page(self, response):
        """
        提取标签下链接
        :param response:
        :return:
        """
        
        # 进入某个标签后，爬取底部分页按钮
        page_lists = response.xpath(".//*[@id='wp_page_numbers']/ul/li")
        # 获取底部分页按钮上的文字，根据文字来判断当前标签页下总共有多少分页
        page_list = page_lists.xpath('.//text()')
        # 如果当前标签页下有多个页面，则再根据第一个按钮是否为“首页”来进行再次提取，因为这里有的页面第一个按钮是“首页”，有的第一个按钮是“1”
        if len(page_lists) > 0:
            if page_list[0].extract() == '首页':
                page_num = len(page_lists) - 3
            
            else:
                page_num = len(page_lists) - 2
        else:
            page_num = 1
        
        # 根据当前标签页的链接，来拼成带页码的链接
        if '_' in response.url:
            index = response.url[::-1].index('_')
            href_pre = response.url[:-index]
        else:
            if page_num == 1:
                href_pre = response.url.split('.html')[0]
            else:
                href_pre = response.url.split('.html')[0] + '_'
        for i in range(1, page_num + 1):
            if page_num == 1:
                href = href_pre + '.html'
            else:
                href = href_pre + str(i) + '.html'
            item = response.meta['item']
            item['page_list'] = href
            # 问题：这里打印item['page_list']还能把所有的url打印出来，而且是正常的，但是一到parse_album里面就有问题，总是只显示最后一个url
            # 解决方案：将原本的meta={'item': item} 修改为 meta={'item': copy.deepcopy(item)}
            # 参考：https://blog.csdn.net/bestbzw/article/details/52894883
            yield scrapy.Request(url=item['page_list'], meta={'item': copy.deepcopy(item)}, callback=self.parse_album)
    
    def parse_album(self, response):
        """
        提取专辑名称和专辑链接
        :param response:
        :return:
        """
        
        albums = response.xpath(".//*[@class='pic']")
        for album in albums:
            album_href = album.xpath(".//a/@href").extract()[0]
            album_name = album.xpath(".//a/img/@alt").extract()[0]
            item = response.meta['item']
            item['album_name'] = album_name
            item['album_href'] = album_href
            
            yield scrapy.Request(url=item['album_href'], meta={'item': copy.deepcopy(item)}, callback=self.parse_img)
    
    def parse_img(self, response):
        """
        提取图片名称和链接
        :param response:
        :return:
        """
        
        img_list = response.xpath(".//p/img")
        
        for img in img_list:
            item = response.meta['item']
            img_title = img.xpath(".//@alt").extract()[0]
            if img_title == '':
                for i in range(1, len(img_list) + 1):
                    img_title = item['album_name'] + '_' + str(i)
            else:
                img_title = img_title
            img_urls = img.xpath(".//@src").extract()
            img_src = img.xpath(".//@src").extract()[0]
            item['img_title'] = img_title
            item['img_src'] = img_src
            item['img_urls'] = img_urls
            
            yield copy.deepcopy(item)

这里面都是数据提取的过程，比较麻烦的在parse_page，各个标签下的链接结构不一样，导致在拼链接的时候需要判断很多种情况，这些的话可以一步一步的调试，遇到错误就多加判断。

下面说下我在学习scrapy时不太理解的地方：

yield是干嘛的，带yield的函数是一个生成器，而不是一个普通的函数了。这个生成器有一个函数就是next函数，next 就相当于 “下一步”，这一次的next开始的地方是接着上一次的next停止的地方执行的，所以调用next的时候生成器并不会从函数的开始执行，而是接着上一步停止的地方开始。所以，在我们的爬虫里面，每次都会从上一个结束的请求开始，爬取下一个链接，然后把所有的数据存储在item里面，而不会每次都重头开始爬取。
parse_tag中yield scrapy.Request(url=item['tag_href'], meta={'item': item}, callback=self.parse_page)，可以理解为把item['tag_href']作为url，传递给parse_page这个request请求，得到新的response用以提取数据，meta={'item': item}可以把之前收集到的item数据传递到下一个方法继续使用和收集
item = response.meta['item']就是接收传递过来的item数据，可以继续使用和收集

3、保存图片

以下为抄其他博主的：

保存图片需要在自定义的 ImagePipeline 类中重载方法：get_media_requests(item, info)和item_completed(results, items, info)，Pipeline将从 item 中获取图片的 URLs 并下载它们，所以必须重载get_media_requests，并返回一个Request对象，这些请求对象将被 Pipeline 处理，当完成下载后，结果将发送到item_completed方法，这些结果为一个二元组的 list，每个元祖的包含(success, image_info_or_failure)。
success: boolean值，true表示成功下载
image_info_or_error：如果success=true，image_info_or_error词典包含以下键值对。失败则包含一些出错信息
url：原始 URL
path：本地存储路径
checksum：校验码

pipelines.py

import scrapy
from scrapy import log
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem


class MztImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        img_src = item['img_src']
        yield scrapy.Request(img_src)
    
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("该Item没有图片")
        return item

仅仅这些还是不够的，你还需要设置下图片保存的路径、图片大小限制、过期天数等等，在settings.py中添加以下代码

IMAGES_STORE = r'E:\\mzt'    # 图片存储路径
IMAGES_EXPIRES = 90             # 过期天数
IMAGES_MIN_HEIGHT = 100         # 图片的最小高度
IMAGES_MIN_WIDTH = 100          # 图片的最小宽度

并且在settings.py中的ITEM_PIPELINES加上'meizitu.pipelines.MztImagesPipeline': 301，

这样当图片的宽带或高度小于最小的时候就不会下载了，下载成功的图片会保存在E:\\\mzt\\full中。

4、爬虫运行

默认的运行需要在命令行中执行 scrapy crawl spider_name，这样的缺点是不能在IDE里面debug代码，比较不方便。所以，可以自己在项目里面新建一个run.py，以后执行这个文件就相当于启动爬虫。

run.py

from scrapy import cmdline

# cmdline.execute("scrapy crawl mzt".split())    # 直接执行，log显示在控制台
cmdline.execute("scrapy crawl mzt -s LOG_FILE=mzt.log".split())  # log保存在项目里面的mzt.log文件

以上两条语句都可以启动爬虫，可根据是否需要保存log来选择，没有选择的注释掉。

爬虫结果展示

迷之马赛克

《python3 + scrapy爬取妹子图(meizitu.com)》 show2.png

后续优化

本来我的本意是图片按标签分类，放在不同的文件夹，然后名称以网页中的名称命名，后续可以按此归类
当我写这篇文章的时候，才发现meizitu网站首页底部是有分页的，尼玛我爬的时候没把浏览器最大化没看到，导致走了无数个弯路，所以如果纯粹的为了爬取结果的话还是多观察观察网站，如果是练手的话怎么麻烦怎么搞吧，当作练习了。

爬虫优化

保存图片到不同的分类文件夹，图片名称还是hash一下，避免图片名称相同的情况
默认情况下，图片是保存在IMAGES_STORE\\full文件夹下面的，现在我们需要更改保存路径，那么需要重写ImagePipeline里面的file_path方法，如下：
```
def file_path(self, request, response=None, info=None):
    item = request.meta['item']
    image_guid = hashlib.sha1(to_bytes(item['img_src'])).hexdigest()
    path = "%s\\%s.%s" % (item['tag_name'], image_guid, item['img_src'].split('.')[-1])
    return path
```
注意，path还不是完整的图片保存路径，完整的应该是IMAGES_STORE\\path
设置IMAGES_STORE
之前的IMAGES_STORE是直接写死的E:\\meizitu，假如这个代码放在linux上运行就会报错，所以我们用os模块来生成文件夹
```
IMAGES_STORE = os.path.join(os.path.dirname(os.getcwd()), 'images')
```

优化run.py

我们可能多次运行爬虫，所以为了记录日志，我们专门把每次爬虫运行的log放在一个专门的文件夹里面，方便追溯，并且给日志文件名称加上时间戳。

#!/usr/bin/env python
# coding=utf-8

# Created by slowchen on 2018/4/13 17:36.

"""
quote:
"""
import os
import time

from scrapy import cmdline

now = time.strftime("%Y%m%d%H%M%S", time.localtime())
os.makedirs(os.path.join(os.getcwd(), 'log'), exist_ok=True)

# 控制台显示log
# cmdline.execute("scrapy crawl mzt".split())

# log保存在log文件夹里面，加时间戳
cmdline.execute(("scrapy crawl mzt -s LOG_FILE=log/mzt_%s.log" % now).split())

项目地址：@码云

    原文作者：慢慢慢慢热
    原文地址: https://www.jianshu.com/p/00c619939f66
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

前言