python之链家爬爬看之一 基本信息

本文讲述了如何从链家爬去楼盘的基本信息。包括地址,楼盘名称,以及均价

1.分析一下大致的网页结构

  • 网站的入口

广州(例)的链家新房的入口地址是:广州新房_广州买房_广州房产信息网(广州链家新房)

主要可能用到的一些相关元素:

  • 具体楼旁的链接: 可能到时候选择器的时候用法哦
<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">万科山景城</a>
  • 楼盘的初步信息: 里面的span 能不能直接被选择器选到?
<div class="where">
                <span class="region">黄埔-萝岗区永顺大道以南</span>
				</div>

<div class="price">
					  <div class="average">
						  							  								  均价
								  <span class="num">26000</span>
								  元/平
							  						  					  </div>
                        						
					</div>
  • 分页的搞法:
https://gz.fang.lianjia.com/loupan/pg3/

可能会涉及到:转义的问题,page-data的问题。。。啊啊啊,前端渣。。。到时候再说吧

<div class="page-box house-lst-page-box" comp-module="page" data-xftrack="10139" page-url="/loupan/pg{page}/" page-data="{&quot;totalPage&quot;:73,&quot;curPage&quot;:3}"><a href="/loupan/pg2/" data-page="2">上一页</a><a href="/loupan/" data-page="1">1</a><a href="/loupan/pg2/" data-page="2">2</a><a class="on" href="/loupan/pg3/" data-page="3">3</a><a href="/loupan/pg4/" data-page="4">4</a><a href="/loupan/pg5/" data-page="5">5</a><span>...</span><a href="/loupan/pg73/" data-page="73">73</a><a href="/loupan/pg4/" data-page="4">下一页</a></div>

2.开始爬数据吧

目标:爬出楼盘的地址和均价的数据

2.1 先把库安装了吧

目前可以预见到的库是BeautifulSoup,先装这个吧

pip install beautifulsoup4

2.2 来来开始吧

2.2.1 python 如何请求一个网页 (参考他人代码,见参考文献)

import urllib2
 
response = urllib2.urlopen("https://gz.fang.lianjia.com/loupan/")
print response.read()

2.2.2 BeautifulSoup 怎么搞

原料已经有了,现在要做的事情,就是把原料喂给 BeautifulSoup 了

from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen("https://gz.fang.lianjia.com/loupan/")
##没有第二个参数会抛一个warn,其实也没事
soup = BeautifulSoup(response.read(),"lxml")
unicode_string = unicode(soup.title.string)
unicode_string
info_panels=soup.find_all("div",class_="info-panel")

2.2.3 解析楼盘的基本信息













<li data-index="0" data-id="">
			<div class="pic-panel">
			  <a target="_blank" data-xftrack="10138" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang"><span class="coverpic-type coverpic-pos-rt">效果图</span><img src="https://image1.ljcdn.com/xf-resblock/403031f8-e1f8-4a62-b7b2-6b041b801633.jpg.239x174.jpg" data-original="https://image1.ljcdn.com/xf-resblock/403031f8-e1f8-4a62-b7b2-6b041b801633.jpg.239x174.jpg" class="lj-lazy" alt="万科山景城" style="display: inline;"></a>
			</div>
			<div class="info-panel">
			  <div class="col-1">
		  	  <h2>
		  		<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">万科山景城</a>
		  		          		  	  </h2>
				<div class="where">
                <span class="region">黄埔-萝岗区永顺大道以南</span>
				</div>
				<div class="area">
				  5居/4居/3居/2居
                   -                   <span>建面 65~112m²</span>
                </div>
          				<div class="other">
									  <span>五证齐全</span>
				  				  <span>车位充足</span>
				  				  <span>复式</span>
				  				  <span>普通住宅</span>
				  				  <span>山景地产</span>
				  				</div>
          				<div class="type">
									  <span class="onsold">在售</span>
				  				  				  <span class="live">住宅</span>
				                      						<span class="allfive">五证齐全</span>
                    				</div>
			  </div>
			  <div class="col-2">
					<div class="price">
					  <div class="average">
						  							  								  均价
								  <span class="num">26000</span>
								  元/平
							  						  					  </div>
                        						
					</div>
			  </div>
			</div>
						<div class="huxing-picture">
				<i class="gap"></i>
			  <div class="huxing-picture-box">
			  	<i class="pre disable-pre"></i>
			  	<div class="huxing-container">
					<ul class="huxing-picture-content" style="width: 1692px;">
											  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/7572941a-4be6-41c5-9a8f-635af8c38a37.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 92m²)">
						<div title="3室2厅2卫 (建面 92m²)">
                              3室2厅2卫 (建面 92m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/f2961405-513c-4dd4-b955-c20521ea978e.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 92m²)">
						<div title="3室2厅2卫 (建面 92m²)">
                              3室2厅2卫 (建面 92m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/7a63608e-67ec-4eb3-baba-e6cb5d7e80e0.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-4室2厅2卫 (建面 112m²)">
						<div title="4室2厅2卫 (建面 112m²)">
                              4室2厅2卫 (建面 112m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/23a95516-4077-46a6-a048-a5d3c6df0e78.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-2室2厅1卫 (建面 65m²)">
						<div title="2室2厅1卫 (建面 65m²)">
                              2室2厅1卫 (建面 65m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/52b2813e-1647-4a3f-a642-460e65288eaf.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅1卫 (建面 84m²)">
						<div title="3室2厅1卫 (建面 84m²)">
                              3室2厅1卫 (建面 84m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/835fd792-401e-473d-8b0f-4585c2321656.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅1卫 (建面 84m²)">
						<div title="3室2厅1卫 (建面 84m²)">
                              3室2厅1卫 (建面 84m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/0b1e3d11-7177-450c-b021-8bc5005e4c51.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-5室2厅3卫 (建面 85m²)">
						<div title="5室2厅3卫 (建面 85m²)">
                              5室2厅3卫 (建面 85m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/cdd913e6-6b69-4aa2-a3b7-32c90a75959d.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 95m²)">
						<div title="3室2厅2卫 (建面 95m²)">
                              3室2厅2卫 (建面 95m²)
              						</div>
							</a>
					  </li>
					  					  <li>
					  	<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">
						<img src="https://s1.ljcdn.com/xinfang/pc/asset/img/new-version/default_block.png?_v=20171113171428" data-original="https://image1.ljcdn.com/x-xf/xhdic-frame/2e4c8326-04bc-4cff-a5c2-cc95633b1c63.jpg.163x123.jpg" class="lj-lazy " alt="万科山景城-3室2厅2卫 (建面 95m²)">
						<div title="3室2厅2卫 (建面 95m²)">
                              3室2厅2卫 (建面 95m²)
              						</div>
							</a>
					  </li>
					  					</ul>
					</div>
					<i class="after"></i>
				</div>
			</div>
					  </li>

暂时我们可以先想办法来搞定基本信息

<div class="info-panel">
			  <div class="col-1">
		  	  <h2>
		  		<a target="_blank" href="/loupan/p_wksjcaatwy/" data-index="1" data-el="xinfang">万科山景城</a>
		  		          		  	  </h2>
				<div class="where">
                <span class="region">黄埔-萝岗区永顺大道以南</span>
				</div>
				<div class="area">
				  5居/4居/3居/2居
                   -                   <span>建面 65~112m²</span>
                </div>
          				<div class="other">
									  <span>五证齐全</span>
				  				  <span>车位充足</span>
				  				  <span>复式</span>
				  				  <span>普通住宅</span>
				  				  <span>山景地产</span>
				  				</div>
          				<div class="type">
									  <span class="onsold">在售</span>
				  				  				  <span class="live">住宅</span>
				                      						<span class="allfive">五证齐全</span>
                    				</div>
			  </div>
			  <div class="col-2">
					<div class="price">
					  <div class="average">
						  							  								  均价
								  <span class="num">26000</span>
								  元/平
							  						  					  </div>
                        						
					</div>
			  </div>
			</div>

info_panels=soup.find_all("div",class_="info-panel")
#楼盘位置
info_panels[0].find("span",class_='region').string
#楼盘名称
info_panels[0].find("a").string  
#均价
info_panels[0].find("span",class_='num').string 

3. 终极代码

from bs4 import BeautifulSoup
def parseUrl(url):
	import urllib2
	response = urllib2.urlopen(url)
	soup = BeautifulSoup(response.read(),"lxml")
	return soup

def parseBaseInfo(item):
	return (item.find("span",class_='region').string,item.find("a").string,item.find("span",class_='num').string )


soup=parseUrl("https://gz.fang.lianjia.com/loupan/")
info_panels=soup.find_all("div",class_="info-panel")
baseInfos=list(map(parseBaseInfo,info_panels))
for (region,name,price) in baseInfos:
	print region +' ' + ' '+name+' '+price+'\n'

4.程中的参考文献

木公的博客 | Anyinlover Blog

Beautiful Soup 4.2.0 文档

Python爬虫入门三之Urllib库的使用 | 静觅

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

写作记录:

2017.11.20 初稿之一,至2.2.1

2017.11.21 初稿之二 ,至2.2.2 进度有点慢

2017.11.22 初稿之三 ,至3 基本成型,格式留后再调

    原文作者:潘巧林
    原文地址: https://zhuanlan.zhihu.com/p/31235592
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞