本文讲述了如何从链家爬去楼盘的基本信息。包括地址,楼盘名称,以及均价
1.分析一下大致的网页结构
- 网站的入口
广州(例)的链家新房的入口地址是:广州新房_广州买房_广州房产信息网(广州链家新房)
主要可能用到的一些相关元素:
- 具体楼旁的链接: 可能到时候选择器的时候用法哦
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">万科山景城</a>
- 楼盘的初步信息: 里面的span 能不能直接被选择器选到?
<div class="where">
<span class="region">黄埔-萝岗区永顺大道以南</span>
</div>
<div class="price">
<div class="average">
均价
<span class="num">26000</span>
元/平
</div>
</div>
- 分页的搞法:
https://gz.fang.lianjia.com/loupan/pg3/
可能会涉及到:转义的问题,page-data的问题。。。啊啊啊,前端渣。。。到时候再说吧
<div class="page-box house-lst-page-box" comp-module="page" data-xftrack="10139" page-url="/loupan/pg{page}/" page-data="{"totalPage":73,"curPage":3}"><a href="/loupan/pg2/" data-page="2">上一页</a><a href="/loupan/" data-page="1">1</a><a href="/loupan/pg2/" data-page="2">2</a><a class="on" href="/loupan/pg3/" data-page="3">3</a><a href="/loupan/pg4/" data-page="4">4</a><a href="/loupan/pg5/" data-page="5">5</a><span>...</span><a href="/loupan/pg73/" data-page="73">73</a><a href="/loupan/pg4/" data-page="4">下一页</a></div>
2.开始爬数据吧
目标:爬出楼盘的地址和均价的数据
2.1 先把库安装了吧
目前可以预见到的库是BeautifulSoup,先装这个吧
pip install beautifulsoup4
2.2 来来开始吧
2.2.1 python 如何请求一个网页 (参考他人代码,见参考文献)
import urllib2
response = urllib2.urlopen("https://gz.fang.lianjia.com/loupan/")
print response.read()
2.2.2 BeautifulSoup 怎么搞
原料已经有了,现在要做的事情,就是把原料喂给 BeautifulSoup 了
from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen("https://gz.fang.lianjia.com/loupan/")
##没有第二个参数会抛一个warn,其实也没事
soup = BeautifulSoup(response.read(),"lxml")
unicode_string = unicode(soup.title.string)
unicode_string
info_panels=soup.find_all("div",class_="info-panel")
2.2.3 解析楼盘的基本信息
<li data-index="0" data-id="">
<div class="pic-panel">
<a target="_blank" data-xftrack="10138" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang"><span class="coverpic-type coverpic-pos-rt">效果图</span><img src="https://image1.ljcdn.com/xf-resblock/403031f8-e1f8-4a62-b7b2-6b041b801633.jpg.239x174.jpg" data-original="https://image1.ljcdn.com/xf-resblock/403031f8-e1f8-4a62-b7b2-6b041b801633.jpg.239x174.jpg" class="lj-lazy" alt="万科山景城" style="display: inline;"></a>
</div>
<div class="info-panel">
<div class="col-1">
<h2>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">万科山景城</a>
</h2>
<div class="where">
<span class="region">黄埔-萝岗区永顺大道以南</span>
</div>
<div class="area">
5居/4居/3居/2居
- <span>建面 65~112m²</span>
</div>
<div class="other">
<span>五证齐全</span>
<span>车位充足</span>
<span>复式</span>
<span>普通住宅</span>
<span>山景地产</span>
</div>
<div class="type">
<span class="onsold">在售</span>
<span class="live">住宅</span>
<span class="allfive">五证齐全</span>
</div>
</div>
<div class="col-2">
<div class="price">
<div class="average">
均价
<span class="num">26000</span>
元/平
</div>
</div>
</div>
</div>
<div class="huxing-picture">
<i class="gap"></i>
<div class="huxing-picture-box">
<i class="pre disable-pre"></i>
<div class="huxing-container">
<ul class="huxing-picture-content" style="width: 1692px;">
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/7572941a-4be6-41c5-9a8f-635af8c38a37.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 92m²)">
<div title="3室2厅2卫 (建面 92m²)">
3室2厅2卫 (建面 92m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/f2961405-513c-4dd4-b955-c20521ea978e.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 92m²)">
<div title="3室2厅2卫 (建面 92m²)">
3室2厅2卫 (建面 92m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/7a63608e-67ec-4eb3-baba-e6cb5d7e80e0.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-4室2厅2卫 (建面 112m²)">
<div title="4室2厅2卫 (建面 112m²)">
4室2厅2卫 (建面 112m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/23a95516-4077-46a6-a048-a5d3c6df0e78.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-2室2厅1卫 (建面 65m²)">
<div title="2室2厅1卫 (建面 65m²)">
2室2厅1卫 (建面 65m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/52b2813e-1647-4a3f-a642-460e65288eaf.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅1卫 (建面 84m²)">
<div title="3室2厅1卫 (建面 84m²)">
3室2厅1卫 (建面 84m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/835fd792-401e-473d-8b0f-4585c2321656.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅1卫 (建面 84m²)">
<div title="3室2厅1卫 (建面 84m²)">
3室2厅1卫 (建面 84m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/0b1e3d11-7177-450c-b021-8bc5005e4c51.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-5室2厅3卫 (建面 85m²)">
<div title="5室2厅3卫 (建面 85m²)">
5室2厅3卫 (建面 85m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/cdd913e6-6b69-4aa2-a3b7-32c90a75959d.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 95m²)">
<div title="3室2厅2卫 (建面 95m²)">
3室2厅2卫 (建面 95m²)
</div>
</a>
</li>
<li>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/2e4c8326-04bc-4cff-a5c2-cc95633b1c63.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 95m²)">
<div title="3室2厅2卫 (建面 95m²)">
3室2厅2卫 (建面 95m²)
</div>
</a>
</li>
</ul>
</div>
<i class="after"></i>
</div>
</div>
</li>
暂时我们可以先想办法来搞定基本信息
<div class="info-panel">
<div class="col-1">
<h2>
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">万科山景城</a>
</h2>
<div class="where">
<span class="region">黄埔-萝岗区永顺大道以南</span>
</div>
<div class="area">
5居/4居/3居/2居
- <span>建面 65~112m²</span>
</div>
<div class="other">
<span>五证齐全</span>
<span>车位充足</span>
<span>复式</span>
<span>普通住宅</span>
<span>山景地产</span>
</div>
<div class="type">
<span class="onsold">在售</span>
<span class="live">住宅</span>
<span class="allfive">五证齐全</span>
</div>
</div>
<div class="col-2">
<div class="price">
<div class="average">
均价
<span class="num">26000</span>
元/平
</div>
</div>
</div>
</div>
info_panels=soup.find_all("div",class_="info-panel")
#楼盘位置
info_panels[0].find("span",class_='region').string
#楼盘名称
info_panels[0].find("a").string
#均价
info_panels[0].find("span",class_='num').string
3. 终极代码
from bs4 import BeautifulSoup
def parseUrl(url):
import urllib2
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(),"lxml")
return soup
def parseBaseInfo(item):
return (item.find("span",class_='region').string,item.find("a").string,item.find("span",class_='num').string )
soup=parseUrl("https://gz.fang.lianjia.com/loupan/")
info_panels=soup.find_all("div",class_="info-panel")
baseInfos=list(map(parseBaseInfo,info_panels))
for (region,name,price) in baseInfos:
print region +' ' + ' '+name+' '+price+'\n'
4.程中的参考文献
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
写作记录:
2017.11.20 初稿之一,至2.2.1
2017.11.21 初稿之二 ,至2.2.2 进度有点慢
2017.11.22 初稿之三 ,至3 基本成型,格式留后再调