Scrapy抓取新浪微博

项目概述:相信很多小伙伴都有用过新浪微博,因为这是当今很火的一款社交app。正因为这样,我们需要获取新浪微博中每一个用户的信息以及评论、发布时间等来满足公司的需求,获取每日热点、评论量、点赞量等相关信息。如今是一个大数据的时代,得数据者得天下,下面教大家如何抓取新浪微博的数据。

首先需要安装python环境(python2.7以及scrapy+selenium+phantomjs+chrome)

一、python2.7+scrapy+ selenium+ phantomjs安装:

以下例子基于python 2.7.9,其他版本同理。
1、下载python

wget https://www.python.org/ftp/python/2.7.9/Python-2.7.9.tgz

2、解压、编译安装(依次执行以下5条命令)

tar -zxvf Python-2.7.9.tgz
cd Python-2.7.9
./configure --prefix=/usr/local/python-2.7.9
make
make install

3、系统自带了python版本,我们需要为新安装的版本添加一个软链

 ln -s /usr/local/python-2.7.9/bin/python /usr/bin/python2.7.9 

4、若需使用该版本,只需输入”python2.7.9 + 空格 + py脚本”

python2.7.9 ~/helloworld.py

scrapy安装:

pip install scrapy
# 如果用到分布式
pip install scrapy_redis

selenium安装:

pip install selenium

phantomjs安装:

PhantomJS下载在/usr/local/src/packet/目录下(这个看个人喜好)

操作系统:CentOS 7 64-bit

bzip2 -d phantomjs-2.1.1-linux-x86_64.tar.bz2
  • 4.再使用tar进行解压到/usr/local/目录下边
    1. 安装依赖软件
tar xvf phantomjs-2.1.1-linux-x86_64.tar -C /usr/local/
yum -y install wget fontconfig
# 重命名(方便以后使用phantomjs命令)
mv /usr/local/phantomjs-2.1.1-linux-x86_64/ /usr/local/phantomjs
  • 6.最后一步就是建立软连接了(在/usr/bin/目录下生产一个phantomjs的软连接,/usr/bin/是啥目录应该清楚,不清楚使用 echo $PATH查看)
ln -s /usr/local/phantomjs/bin/phantomjs /usr/bin/

到这一步就安装成功了,接下来测试一下(经过上面建立的软连接,你就可以使用了,而且是想使用命令一样的进行使用哦!):

[root@localhost ~]# phantomjs

二、chrome安装:

说明:要在服务器上安装chrome运行环境,使他能够同selenium自动化测试脚本一同抓取数据,那么久需要配置chrome依赖环境。

selenium+chromedriver在服务器运行

1.前言
想使用selenium从网站上抓数据,但有时候使用phantomjs会出错。chrome现在也有无界面运行模式了,以后就可以不用phantomjs了。
但在服务器安装chrome时出现了一些错误,这里总结一下整个安装过程

2.ubuntu上安装chrome

# Install Google Chrome# https://askubuntu.com/questions/79280/how-to-install-chrome-browser-properly-via-command-linesudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debsudo dpkg -i google-chrome*.deb  # Might show "errors", fixed by next linesudo apt-get install -f

这时应该已经安装好了,用下边的命行运行测试一下:

google-chrome --headless --remote-debugging-port=9222 https://chromium.org --disable-gpu

这里是使用headless模式进行远程调试,ubuntu上大多没有gpu,所以–disable-gpu以免报错。

之后可以再打开一个ssh连接到服务器,使用命令行访问服务器的本地的9222端口:

curl http://localhost:9222

如果安装好了,会看到调试信息。但我这里会报一个错误,下边是错误的解决办法。

  • 一、可能的错误解决方法
    运行完上边的命令可能会报一个不能在root下运行chrome的错误。这个时候使用下边方设置一下chrome
    (1).找到google-chrome文件
    我的位置位于/opt/google/chrome/
    (2).用vi打开google-chrome文件
vi /opt/google/chrome/google-chrome

在文件中找到

exec -a "$0" "$HERE/chrome" "$@"

(3).在后面添加 –user-data-dir –no-sandbox即可,整条shell命令就是

exec -a "$0" "$HERE/chrome" "$@" --user-data-dir --no-sandbox

(4).再重新打开google-chrome即可正常访问!

3.安装chrome驱动chromedriver

下载chromedriver
chromedriver提供了操作chrome的api,是selenium控制chrome的桥梁。
chromedriver最好安装最新版的,记的我一开始安装的不是最新版的,会报一个错。用最新版的chromedriver就没有问题,最新版的可以在下边地址找到
https://sites.google.com/a/chromium.org/chromedriver/downloads

我写这个文章时最新版是2.37

wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zipunzip chromedriver_linux64.zip

到这里服务器端的无界面版chrome就安装好了。

4.无界面版chrome使用方法

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'")
wd = webdriver.Chrome(chrome_options=chrome_options,executable_path='/home/chrome/chromedriver')
wd.get("https://www.163.com")
content = wd.page_source.encode('utf-8')print content
wd.quit()

三、抓取数据:

抓取新浪微博,我们需要模拟登陆,登陆成功后获取cookie进行保存,为了不被封禁账号,我们需要用很多微博账号来进行抓取(根据你的数据量的需求提供账号的多少)
1.首先模拟登陆获取cookie

#!/usr/bin/env python
# encoding: utf-8
import datetime
import json
import base64
from time import sleep
import pymongo
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

输入你的微博账号和密码,可去淘宝买,一元七个。
建议买几十个,微博反扒的厉害,太频繁了会出现302转移。
或者你也可以把时间间隔调大点。

WeiBoAccounts = [
{'username': 'javzx61369@game.weibo.com', 'password': 'esdo77127'},
{'username': 'v640e2@163.com', 'password': 'wy539067'},
{'username': 'd3fj3l@163.com', 'password': 'af730743'},
{'username': 'oia1xs@163.com', 'password': 'tw635958'},
]
'''
WeiBoAccounts = [{'username': '你的用户名', 'password': '你的密码'}]
cookies = []
client = pymongo.MongoClient("192.168.98.5", 27017)
db = client["Sina"]
userAccount = db["userAccount"]
def get_cookie_from_weibo(username, password):
    driver = webdriver.PhantomJS()
    driver.get('https://weibo.cn')
    print driver.title
    assert "微博" in driver.title
    login_link = driver.find_element_by_link_text('登录')
    ActionChains(driver).move_to_element(login_link).click().perform()
    login_name = WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located((By.ID, "loginName"))
    )
    login_password = driver.find_element_by_id("loginPassword")
    login_name.send_keys(username)
    login_password.send_keys(password)
    login_button = driver.find_element_by_id("loginAction")
    login_button.click()
    # 这里停留了10秒观察一下启动的Chrome是否登陆成功了,没有的化手动登陆进去
    sleep(10)
    cookie = driver.get_cookies()
    #print driver.page_source
    print driver.current_url
    driver.close()
    return cookie
def init_cookies():
    for cookie in userAccount.find():
        cookies.append(cookie['cookie'])
if __name__ == "__main__":
    try:
        userAccount.drop()
    except Exception as e:
        pass
    for account in WeiBoAccounts:
        cookie = get_cookie_from_weibo(account["username"], account["password"])
        userAccount.insert_one({"_id": account["username"], "cookie": cookie})
    init_cookies()

代码很简单。就是模拟登陆获取cookie插入到mongo数据库中!方便以后请求数据进行使用。init_cookies()这个函数供middleware中间件后期使用。

*大沙发

  • 大厦

scrapy items代码如下:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field
class InformationItem(Item):
    """ 个人信息 """
    _id = Field()  # 用户ID
    NickName = Field()  # 昵称
    Gender = Field()  # 性别
    Province = Field()  # 所在省
    City = Field()  # 所在城市
    BriefIntroduction = Field()  # 简介
    Birthday = Field()  # 生日
    Num_Tweets = Field()  # 微博数
    Num_Follows = Field()  # 关注数
    Num_Fans = Field()  # 粉丝数
    SexOrientation = Field()  # 性取向
    Sentiment = Field()  # 感情状况
    VIPlevel = Field()  # 会员等级
    Authentication = Field()  # 认证
    URL = Field()  # 首页链接
class TweetsItem(Item):
    """ 微博信息 """
    _id = Field()  # 用户ID-微博ID
    ID = Field()  # 用户ID
    Content = Field()  # 微博内容
    PubTime = Field()  # 发表时间
    Co_oridinates = Field()  # 定位坐标
    Tools = Field()  # 发表工具/平台
    Like = Field()  # 点赞数
    Comment = Field()  # 评论数
    Transfer = Field()  # 转载数
    filepath = Field()
class RelationshipsItem(Item):
    """ 用户关系,只保留与关注的关系 """
    fan_id = Field()
    followed_id = Field()  # 被关注者的ID

2.初始请求代码:

#!/usr/bin/env python
# encoding: utf-8
""" 初始的待爬队列 """
weiboID = [
    #"5303798085"
    #'6033587203'
    '6092234294']

3.scrapy spider代码如下:

此代码主要是用于解析页面中获取的数据信息,请详阅# encoding: utf-8
import datetime
import requests
import re
from lxml import etree
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sina.config import weiboID
from sina.items import TweetsItem, InformationItem, RelationshipsItem
import time
import random
def rand_num():
    number = ""
    for i in range(5):
        number += str(random.randint(0,9))
    return number
class Spider(Spider):
    name = "SinaSpider"
    host = "https://weibo.cn"
    start_urls = list(set(weiboID))
    filepath = '/home/YuQing/content/'
    def start_requests(self):
        for uid in self.start_urls:
            yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information)
    def parse_information(self, response):
        """ 抓取个人信息 """
        informationItem = InformationItem()
        selector = Selector(response)
        ID = re.findall('(\d+)/info', response.url)[0]
        print response.url, response.body
        try:
            text1 = ";".join(selector.xpath('body/div[@class="c"]//text()').extract())  # 获取标签里的所有text()
            nickname = re.findall('昵称;?[::]?(.*?);', text1)
            gender = re.findall('性别;?[::]?(.*?);', text1)
            place = re.findall('地区;?[::]?(.*?);', text1)
            briefIntroduction = re.findall('简介;?[::]?(.*?);', text1)
            birthday = re.findall('生日;?[::]?(.*?);', text1)
            sexOrientation = re.findall('性取向;?[::]?(.*?);', text1)
            sentiment = re.findall('感情状况;?[::]?(.*?);', text1)
            vipLevel = re.findall('会员等级;?[::]?(.*?);', text1)
            authentication = re.findall('认证;?[::]?(.*?);', text1)
            url = re.findall('互联网;?[::]?(.*?);', text1)
            informationItem["_id"] = ID
            if nickname and nickname[0]:
                informationItem["NickName"] = nickname[0].replace(u"\xa0", "")
            if gender and gender[0]:
                informationItem["Gender"] = gender[0].replace(u"\xa0", "")
            if place and place[0]:
                place = place[0].replace(u"\xa0", "").split(" ")
                informationItem["Province"] = place[0]
                if len(place) > 1:
                    informationItem["City"] = place[1]
            if briefIntroduction and briefIntroduction[0]:
                informationItem["BriefIntroduction"] = briefIntroduction[0].replace(u"\xa0", "")
            if birthday and birthday[0]:
                try:
                    birthday = datetime.datetime.strptime(birthday[0], "%Y-%m-%d")
                    informationItem["Birthday"] = birthday - datetime.timedelta(hours=8)
                except Exception:
                    informationItem['Birthday'] = birthday[0]  # 有可能是星座,而非时间
            if sexOrientation and sexOrientation[0]:
                if sexOrientation[0].replace(u"\xa0", "") == gender[0]:
                    informationItem["SexOrientation"] = "同性恋"
                else:
                    informationItem["SexOrientation"] = "异性恋"
            if sentiment and sentiment[0]:
                informationItem["Sentiment"] = sentiment[0].replace(u"\xa0", "")
            if vipLevel and vipLevel[0]:
                informationItem["VIPlevel"] = vipLevel[0].replace(u"\xa0", "")
            if authentication and authentication[0]:
                informationItem["Authentication"] = authentication[0].replace(u"\xa0", "")
            if url:
                informationItem["URL"] = url[0]
            try:
                urlothers = "https://weibo.cn/attgroup/opening?uid=%s" % ID
                new_ck = {}
                for ck in response.request.cookies:
                    new_ck[ck['name']] = ck['value']
                r = requests.get(urlothers, cookies=new_ck, timeout=5)
                if r.status_code == 200:
                    selector = etree.HTML(r.content)
                    texts = ";".join(selector.xpath('//body//div[@class="tip2"]/a//text()'))
                    print texts
                    if texts:
                        # num_tweets = re.findall(r'微博\[(\d+)\]', texts)
                        num_tweets = texts.split(';')[0].replace('微博[', '').replace(']','')
                        # num_follows = re.findall(r'关注\[(\d+)\]', texts)
                        num_follows = texts.split(';')[1].replace('关注[', '').replace(']','')
                        # num_fans = re.findall(r'粉丝\[(\d+)\]', texts)
                        num_fans = texts.split(';')[2].replace('粉丝[', '').replace(']','')
                        if len(num_tweets) > 0:
                            informationItem["Num_Tweets"] = int(num_tweets)
                        if num_follows:
                            informationItem["Num_Follows"] = int(num_follows)
                        if num_fans:
                            informationItem["Num_Fans"] = int(num_fans)
            except Exception as e:
                print e
        except Exception as e:
            pass
        else:
            yield informationItem
        if informationItem["Num_Tweets"] and informationItem["Num_Tweets"] < 5000:
            yield Request(url="https://weibo.cn/%s/profile?filter=1&page=1" % ID, callback=self.parse_tweets,
                          dont_filter=True)
        if informationItem["Num_Follows"] and informationItem["Num_Follows"] < 500:
            yield Request(url="https://weibo.cn/%s/follow" % ID, callback=self.parse_relationship, dont_filter=True)
        if informationItem["Num_Fans"] and informationItem["Num_Fans"] < 500:
            yield Request(url="https://weibo.cn/%s/fans" % ID, callback=self.parse_relationship, dont_filter=True)
    def parse_tweets(self, response):
        """ 抓取微博数据 """
        selector = Selector(response)
        ID = re.findall('(\d+)/profile', response.url)[0]
        divs = selector.xpath('body/div[@class="c" and @id]')
        for div in divs:
            try:
                tweetsItems = TweetsItem()
                id = div.xpath('@id').extract_first()  # 微博ID
                content = div.xpath('div/span[@class="ctt"]//text()').extract()  # 微博内容
                cooridinates = div.xpath('div/a/@href').extract()  # 定位坐标
                like = re.findall('赞\[(\d+)\]', div.extract())  # 点赞数
                transfer = re.findall('转发\[(\d+)\]', div.extract())  # 转载数
                comment = re.findall('评论\[(\d+)\]', div.extract())  # 评论数
                others = div.xpath('div/span[@class="ct"]/text()').extract()  # 求时间和使用工具(手机或平台)
                tweetsItems["_id"] = ID + "-" + id
                tweetsItems["ID"] = ID
                if content:
                    tweetsItems["Content"] = " ".join(content).strip('[位置]')  # 去掉最后的"[位置]"
                if cooridinates:
                    cooridinates = re.findall('center=([\d.,]+)', cooridinates[0])
                    if cooridinates:
                        tweetsItems["Co_oridinates"] = cooridinates[0]
                if like:
                    tweetsItems["Like"] = int(like[0])
                if transfer:
                    tweetsItems["Transfer"] = int(transfer[0])
                if comment:
                    tweetsItems["Comment"] = int(comment[0])
                if others:
                    others = others[0].split('来自')
                    tweetsItems["PubTime"] = others[0].replace(u"\xa0", "")
                    if len(others) == 2:
                        tweetsItems["Tools"] = others[1].replace(u"\xa0", "")
                filename = 'wb_'+time.strftime('%Y%m%d%H%M%S')+'_'+rand_num()+'.txt'
                tweetsItems["filepath"] = self.filepath + filename
                yield tweetsItems
            except Exception as e:
                print e,111111111111111111111
                self.logger.info(e)
                pass
        next_page = '下页'.decode('utf-8')
        url_next = selector.xpath('body/div[@class="pa" and @id="pagelist"]/form/div/a[text()="%s"]/@href' % next_page).extract()
        if url_next:
            yield Request(url=self.host + url_next[0], callback=self.parse_tweets, dont_filter=True)
    def parse_relationship(self, response):
        """ 打开url爬取里面的个人ID """
        selector = Selector(response)
        if "/follow" in response.url:
            ID = re.findall('(\d+)/follow', response.url)[0]
            flag = True
        else:
            ID = re.findall('(\d+)/fans', response.url)[0]
            flag = False
        he = "关注他".decode('utf-8')
        she = "关注她".decode('utf-8')
        urls = selector.xpath('//a[text()="%s" or text()="%s"]/@href' % (he, she)).extract()
        uids = re.findall('uid=(\d+)', ";".join(urls), re.S)
        for uid in uids:
            relationshipsItem = RelationshipsItem()
            relationshipsItem["fan_id"] = ID if flag else uid
            relationshipsItem["followed_id"] = uid if flag else ID
            yield relationshipsItem
            yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information)
        next_page = '下页'.decode('utf-8')
        next_url = selector.xpath('//a[text()="%s"]/@href' % next_page).extract()
        if next_url:
            yield Request(url=self.host + next_url[0], callback=self.parse_relationship, dont_filter=True)
  1. scrapy middlewares代码如下:
# encoding: utf-8
import random
from sina.cookies import cookies, init_cookies
from sina.user_agents import agents
class UserAgentMiddleware(object):
    """ 换User-Agent """
    def process_request(self, request, spider):
        agent = random.choice(agents)
        request.headers["User-Agent"] = agent
class CookiesMiddleware(object):
    """ 换Cookie """
    def __init__(self):
        init_cookies()
    def process_request(self, request, spider):
        cookie = random.choice(cookies)
        request.cookies = cookie
这一点就是获取到mongo中保存的cookie在scrapy下载中间件去发送请求的时候携带给request,这样就可以携带cookie去获取数据了。

5.scrapy pipelines保存数据:

众所周知,pipelines是用来清洗数据和保存数据的管道
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from sina.items import RelationshipsItem, TweetsItem, InformationItem
import time
import random
import json
class MongoDBPipeline(object):
    def __init__(self):
        clinet = pymongo.MongoClient("服务器的ip", 27017)
        db = clinet["Sina"]
        self.Information = db["Information"]
        self.Tweets = db["Tweets"]
        self.Relationships = db["Relationships"]
    def process_item(self, item, spider):
        """ 判断item的类型,并作相应的处理,再入数据库 """
        if isinstance(item, RelationshipsItem):
            try:
                self.Relationships.insert(dict(item))
            except Exception:
                pass
        elif isinstance(item, TweetsItem):
            try:
                self.Tweets.insert(dict(item))
                filename = item['filepath']
                lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
                with open(filename, 'w') as f:
                    f.write(lines)
            except Exception,e:
                print e
        elif isinstance(item, InformationItem):
            try:
                self.Information.insert(dict(item))
            except Exception:
                pass
        return item

6.为了数据稳定性抓取。我们还需要构建一个user_agent代理池去不停的跟换并伪装成浏览器。具体实现如下:

#!/usr/bin/env python
# encoding: utf-8
""" User-Agents """
agents = [
    "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
    "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
    "Mozilla/2.02E (Win95; U)",
    "Mozilla/3.01Gold (Win95; I)",
    "Mozilla/4.8 [en] (Windows NT 5.1; U)",
    "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
    "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
    "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"]

middlewares中会获取这个user-agent数组在scrapy的process_request()方法中,将请求携带指定的user-agent去发送请求。

走到这里基本都介绍完了,谢谢大家!如有疑问请留言,我会详细回复。(第一次写简书,请多多包涵)后续会出如何抓取知乎、今日头条等文章!

《Scrapy抓取新浪微博》 u=2200166214,500725521&fm=26&gp=0.jpg

    原文作者:可爱的小虫虫
    原文地址: https://www.jianshu.com/p/1890e9b3ba37
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞