python自动化爬取地名的gps信息 !

知识点:

1、python

2、scrapy爬虫框架+mongodb数据库

3、http://www.gpsspg.com/maps.htm网站

背景介绍:

最近客户要求找出500个小区的经纬度信息,经分析如果手工在网站上查找经纬度,需要耗费1天时间,而且下次客户再有类似需求则还需要人工查找经纬度,非常费事,好在可以利用python的scrapy框架爬取相关小区经纬度,实现自动化处理。

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570866″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

Python学习交流群:1004391443,有大牛答疑,有资源共享!有想学习python编程的,想提升自己能力的,欢迎加入讨论学习。

http://www.gpsspg.com/maps.htm网站数据分析:

在该网址上输入地址后会自动弹出查找结果,可取前10条进行分析对比找出最精确gps结果。

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570870″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

跟踪后台数据交互内容可以看到如下数据:

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570874″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

requesturl分析:wd=后面数字为输入框信息编码后的结果

RequestURL:https://apis.map.qq.com/jsapi?qt=poi&wd=%E7%9F%B3%E5%AE%B6%E5%BA%84%E6%A1%A5%E8%A5%BF%E5%8C%BA%E7%95%99%E8%90%A5%E5%8D%8E%E8%8B%91&pn=0&rn=10&rich_source=qipao&rich=web&nj=0&c=1&key=FBOBZ-VODWU-C7SVF-B2BDI-UK3JE-YBFUS&output=jsonp&pf=jsapi&ref=jsapi&cb=qq.maps._svcb3.search_service_0

网站返回数据为json格式数据,根据分析对比,在pois字段中返回10条查询结果为正好对于网站显示出的前10条结果。

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570878″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image> <tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570882″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

分析结论:网站返回的json数据里有所需的相关gps信息,通过python提取json数据后经过处理可以找到符合要求gps。

scrapy爬虫框架:

基于python的爬虫框架有很多比如django、scrapy等,对于小型网站的爬取我习惯使用较简单的scrapy爬虫框架。

爬虫框架架构:

spiders/main.py 为程序入口

spiders/getpoint.py 为scrapy爬虫启动程序

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570888″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

数据处理逻辑:

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570892″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

getpoint.py 源码:

getpoint.py主要实现从source.xlsx里读取需要查找经纬度的小区名称,直接从网站上提取小区名称对应的10个经纬度信息,并通过最大编辑距离计算出最精确gps。

– coding: utf-8 –

import scrapy

import pandas as pd

from urllib.parse import quote

import json

import difflib

from scrapy import log

from gpsspg.items import PointItem

import chardet

class GetpointSpider(scrapy.Spider):

name = ‘getpoint’

allowed_domains = [‘http://www.gpsspg.com’]

base_url = ‘https://apis.map.qq.com/jsapi’

df = pd.read_excel(“source.xlsx”, sheet_name=’Sheet1′)

def start_requests(self):

for i in self.df.index:

  searchkey = self.df.loc[i, '城市'] + self.df.loc[i, '县区'] + self.df.loc[i, '小区名称']

  address = self.df.loc[i,'地址']

  para = r'?qt=poi&wd='+quote(searchkey)+'pn=0&rn=10&rich_source=qipao&rich=web&nj=0&c=1&key=FBOBZ-VODWU-C7SVF-B2BDI-UK3JE-YBFUS&output=jsonp&pf=jsapi&ref=jsapi&cb=qq.maps._svcb3.search_service_0'

  url = self.base_url + para

  yield scrapy.Request(url=url,method='GET',callback=self.parse,meta={'name':self.df.loc[i,'小区名称'],'city':self.df.loc[i,'城市'],'area':self.df.loc[i,'县区']})

def parse(self, response):

cs = chardet.detect(response.body)

rsp = response.body.decode(cs.get('encoding','utf-8'))

rsp = rsp.replace('qq.maps._svcb3.search_service_0 && qq.maps._svcb3.search_service_0(','')

rsp = rsp[0:-1]

print(rsp)

df = pd.read_json(rsp)

name = response.meta['name']

city = response.meta['city']

area = response.meta['area']

ls = df['detail']['pois']

for l in ls:

  if city == l['POI_PATH'][1]['cname'] and area == l['POI_PATH'][0]['cname']:

    r = difflib.SequenceMatcher(None,name,l['name']).quick_ratio()

  else:

    r = 0

  l['result'] = r

tuple_data = sorted(ls,key=lambda x:x['result'],reverse=True)

if tuple_data[0]['result']>0.62:

  item = PointItem()

  item['city'] = city

  item['area'] = area

  item['name'] = name

  item['x'] = tuple_data[0]['pointx']

  item['y'] = tuple_data[0]['pointy']

  item['scity'] = tuple_data[0]['POI_PATH'][1]['cname']

  item['sarea'] =tuple_data[0]['POI_PATH'][0]['cname']

  item['saddr'] =tuple_data[0]['addr']

  item['sname'] = tuple_data[0]['name']

  item['sresult'] = tuple_data[0]['result']

yield item

pipelines.py源码:

该模块主要功能是将提取的经纬度信息保存至mongodb数据库中

from pymongo import MongoClient

import pandas as pd

from scrapy.conf import settings

class GpsspgPipeline(object):

def init(self):

self.client = MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])

db = self.client[settings['MONGODB_DB']]

self.collection_companyinfo = db[settings['COLLECTION_POINT']]

def process_item(self, item, spider):

print(item['name']+":::"+item['sname']+":::"+str(item['sresult']))

self.collection_companyinfo.insert(dict(item))

def close_spider(self, spider):

self.client.close()

items.py 源码:

import scrapy

from scrapy import Field

class PointItem(scrapy.Item):

x = Field()

y = Field()

scity = Field()

sarea = Field()

saddr = Field()

sname = Field()

sresult = Field()

name = Field()

addr = Field()

area = Field()

city = Field()

执行效果:

mongodb数据库内容展示:

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570902″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

关键算法:

从json中提取10个经纬度位置,通过json中name名称与小区名称进行对比,找出相似度最高的name,继而匹配出经纬度,用到的相似性对比方法是——编辑距离算法,首先由俄国科学家Levenshtein提出的,又叫Levenshtein Distance。

主要代码:

<tt-image data-tteditor-tag=”tteditorTag” contenteditable=”false” class=”syl1556437570906″ data-render-status=”finished” data-syl-blot=”image” style=”box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: “PingFang SC”, “Hiragino Sans GB”, “Microsoft YaHei”, “WenQuanYi Micro Hei”, “Helvetica Neue”, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;”>
《python自动化爬取地名的gps信息 !》 image

<input class=”pgc-img-caption-ipt” placeholder=”图片描述(最多50字)” value=”” style=”box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;”></tt-image>

    原文作者:不谈风月_0eb8
    原文地址: https://www.jianshu.com/p/46d0d46a3ee1
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞