python爬取西刺代理所有数据！

2019年4月19日 318次阅读来源: 萌新程序员

今天来爬取西刺代理的所有数据，采用 mongoDB 储存。

首先分析网址构造，找出规律，这里我们可以看到，总共有 3639 页。

《python爬取西刺代理所有数据！》

我们跳转到 3639 页，可以看出网址规律为：https://www.xicidaili.com/nn/xxxx

《python爬取西刺代理所有数据！》

接着我们分析页面源码，通过正则来匹配出我们想要的数据。

《python爬取西刺代理所有数据！》

正则调试：

《python爬取西刺代理所有数据！》

for i in range(1,3640):
    # 构造 URL 地址
    url='https://www.xicidaili.com/nn/'+str(i)
    # 构造 headers
    headers={
        'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'
        }
    # 访问该 URL 地址，获取到页面源码
    html=requests.get(url,headers=headers).text
    # 构造正则表达式
    _retext=('<td>(.*?)</td>[\s\S]*?<td>(.*?)</td>[\s\S]*?'
             +'<a.*?>(.*?)</a>[\s\S]*?<td.*?country">(.*?)'
             +'</td>[\s\S]*?<td>(.*?)</td>[\s\S]*?<td>(.*?)'
             +'</td>[\s\S]*?<td>(.*?)</td>')
    # 正则匹配出页面中的数据
    content=re.findall(_retext,html)

获取到数据以后，我们连接 mongoDB。

# 连接mongoDB
client=pymongo.MongoClient(host='localhost',port=27017)
# 打开‘xicidata’，如果不存在则创建
db=client['xicidata']
# 创建 ‘data’ 表
t1=db['data']

接着把数据写入表‘data’中

    for j in content:
        # 通过for,把获取到的数据放到dict中
        _data={'IP地址':j[0],
               'port':j[1],
               'adresse':j[2],
               'N':j[3],
               'type':j[4],
               'cunhuo':j[5],
               'time':j[6]
               }
        # 把 dict 插入到表中
        t1.insert_one(_data)
        print('writing...')

最后，我们贴出全部代码

import requests
import re
import pymongo
from time import sleep
# 连接mongoDB
client=pymongo.MongoClient(host='localhost',port=27017)
# 打开‘xicidata’，如果不存在则创建
db=client['xicidata']
# 创建 ‘data’ 表
t1=db['data']
for i in range(1,3640):
    # 构造 URL 地址
    url='https://www.xicidaili.com/nn/'+str(i)
    # 构造 headers
    headers={
        'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'
        }
    # 访问该 URL 地址，获取到页面源码
    html=requests.get(url,headers=headers).text
    sleep(5)
    # 构造正则表达式
    _retext=('<td>(.*?)</td>[\s\S]*?<td>(.*?)</td>[\s\S]*?'
             +'<a.*?>(.*?)</a>[\s\S]*?<td.*?country">(.*?)'
             +'</td>[\s\S]*?<td>(.*?)</td>[\s\S]*?<td>(.*?)'
             +'</td>[\s\S]*?<td>(.*?)</td>')
    # 正则匹配出页面中的数据
    content=re.findall(_retext,html)
    # 迭代出数据
    for j in content:
        # 通过for,把获取到的数据放到dict中
        _data={'IP地址':j[0],
               'port':j[1],
               'adresse':j[2],
               'N':j[3],
               'type':j[4],
               'cunhuo':j[5],
               'time':j[6]
               }
        # 把 dict 插入到表中
        t1.insert_one(_data)
        print('writing...')

通过 pycharm 的插件，我们看到，mongoDB成功创建了库‘xicidata’,该库下成功创建了表‘data’。

《python爬取西刺代理所有数据！》

数据写入成功！

《python爬取西刺代理所有数据！》

    原文作者：萌新程序员
    原文地址: https://zhuanlan.zhihu.com/p/60886263
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。