从零开始学爬虫（1）：爬取房天下二手房信息

2019年7月1日 343次阅读来源: 龙少伊

这是我学习爬虫的笔记，作为备忘，如果可以帮到大家，那就更好了~
从零开始学爬虫（1）：爬取房天下二手房信息
 从零开始学爬虫（2）：突破限制，分类爬取，获得全部数据
 从零开始学爬虫（3）：通过MongoDB数据库获取爬虫数据

一、环境搭建

1、安装python-3.5.2-amd64和pycharm-community-2016.3.2
参考http://jingyan.baidu.com/article/a17d5285ed78e88098c8f222.html?st=2&net_type=&bd_page_type=1&os=0&rst=&word=www.10010
2、配置网页解析相关库
以管理员身份运行命令提示符
1）pip3 install beautifulsoup4
2）pip3 install requests
3）pip3 install lxml

二、源代码

1、导入beautifulsoup和requests库

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests

2、写爬虫主函数

def spider_1(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,'lxml')

    titles = soup.select('dd > p.title > a')            # 标题
    hrefs = soup.select('dd > p.title > a')            # 链接
    details = soup.select('dd > p.mt12')                # 建筑信息
    courts = soup.select('dd > p:nth-of-type(3) > a')   # 小区
    adds = soup.select('dd > p:nth-of-type(3) > span')  # 地址
    areas = soup.select('dd > div.area.alignR > p:nth-of-type(1)')     # 面积
    prices = soup.select('dd > div.moreInfo > p:nth-of-type(1) > span.price')  # 总价
    danjias = soup.select('dd > div.moreInfo > p.danjia.alignR.mt5')    # 单价
    authors = soup.select('dd > p.gray6.mt10 > a')      # 发布者
    tags = soup.select('dd > div.mt8.clearfix > div.pt4.floatl')   # 标签

    for title, href, detail, court, add, area, price, danjia, author, tag in zip(titles, hrefs, details, courts, adds, areas, prices, danjias, authors, tags):
        data = {
            'title': title.get_text(),
            'href': 'http://esf.xian.fang.com' + href.get('href'),
            'detail': list(detail.stripped_strings),
            'court': court.get_text(),
            'add': add.get_text(),
            'area': area.get_text(),
            'price': price.get_text(),
            'danjia': danjia.get_text(),
            'author': author.get_text(),
            'tag': list(tag.stripped_strings)
        }
        print(data)

3、调用spider_1函数爬取指定网页

spider_1('http://esf.xian.fang.com/')

4、循环翻页爬取二手房信息
考虑到每页只显示30条，总共100页，写一个循环调用的语句，把100页的内容全部爬下来

# 循环，把第2-100页全部爬下来
page = 1
while page < 100:
    url = 'http://esf.xian.fang.com/house/i3'+str(page+1)
    spider_1(url)
    page = page + 1

由于房天下的二手房信息是实时更新的，其默认排序是按照发布时间，因此在爬取过程中，会有重复的数据，如下图，3000条数据中有523条重复（为避免重复可以尝试倒序循环爬取）。

《从零开始学爬虫（1）：爬取房天下二手房信息》图片.png

三、小结

解析静态网页，爬取可见信息相对容易，听说有的网站还有反爬机制……路漫漫其修远兮，吾将上下而求索！

    原文作者：龙少伊
    原文地址: https://www.jianshu.com/p/666366e03e43
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。