Python爬虫实战之抓取淘宝MM照片（一）

2019年6月16日 144次阅读来源: PatrickZheng

背景

Python爬虫系列教程的一次实战，然而淘宝进行过页面改版，现在已经没有淘宝MM这个版面，取而代之的是淘女郎。改版后，页面是使用JS渲染的，并不能直接通过url来切换页码。该系列教程后续讲到了 selenium + phantomJS，通过这个组合来模拟操作，进行页码切换等。

对于上述组合，初步学习可以参考：

Python爬虫利器四之PhantomJS的用法（如果了解JavaScript会比较容易理解）
Python爬虫利器五之Selenium的用法

在学习过程中，不仅仅看上面两篇文章，文章里面有列出官方文档、以及其它学习材料。另外，要善用搜索引擎去获取想要的答案。

开始动手

源地址：https://www.taobao.com/markets/mm/mmku

网页源码

我用的是Chrome浏览器，ctrl+shift+i 可以调出开发者工具
然后定位图片元素所在位置

《Python爬虫实战之抓取淘宝MM照片（一）》

定位

通过对网页源码的分析，可以看到，所有图片都是在 class=’listing_cons’的div里，而每一个 class=’cons_li’的div就是一位MM的信息
使用 beautiful soup 进行DOM操作

关于beautiful soup的介绍，可参考Python爬虫利器二之Beautiful Soup的用法

找出照片板块

soup = BeautifulSoup(driver.page_source, 'lxml')
cons_li_list = soup.select('.cons_li')

每一位MM的信息

# 昵称
name = cons_li.select('.item_name')[0].get_text()
# 照片链接
# 由于js图片延迟加载，img标签指向链接的属性有可能不是src，而是data-ks-lazyload
img = cons_li.select('.item_img img')[0]
img_src = img.get('src')
if img_src is None:
    img_src = img.get('data-ks-lazyload')

第一版结果

读取每一位的昵称和图片链接，保存到文件中

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date : 2017-06-18 17:00:19
# @Author : kk (zwk.patrick@foxmail.com)
# @Link : blog.csdn.net/PatrickZheng
# @Version : $Id$

from selenium import webdriver
from bs4 import BeautifulSoup

# 如果在环境变量PATH中添加了phantomJS，此处不需要指明executable_path
driver = webdriver.PhantomJS(executable_path='D:\workplace\spider\phantomjs-2.1.1-windows\phantomjs.exe')
driver.get('https://www.taobao.com/markets/mm/mmku')

soup = BeautifulSoup(driver.page_source, 'lxml')

# 每个MM的展示是放在 属性class=cons_li的div中
cons_li_list = soup.select('.cons_li')
print len(cons_li_list)

try:
    f = open('mm_detail.txt', 'a')
    for cons_li in cons_li_list:
        name = cons_li.select('.item_name')[0].get_text()
        print name
        f.write((name+'\n').encode('utf-8'))
        img = cons_li.select('.item_img img')[0]
        img_src = img.get('src')
        if img_src is None:
            img_src = img.get('data-ks-lazyload')
        print img_src
        f.write(img_src.encode('utf-8'))
finally:
    if f:
        f.close()

driver.close()
print 'done.'

上述源码放到了 Patrick-kk的github，一起学习交流

    原文作者：PatrickZheng
    原文地址: https://blog.csdn.net/PatrickZheng/article/details/73448225
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。