python爬虫---->github上python的项目,python框架---->pymysql的使用

  这里面通过爬虫github上的一些start比较高的python项目来学习一下BeautifulSoup和pymysql的使用。我一直以为山是水的故事,云是风的故事,你是我的故事,可是却不知道,我是不是你的故事。

 

github的python爬虫

爬虫的需求:爬取github上有关python的优质项目,以下是测试用例,并没有爬取很多数据。

一、实现基础功能的爬虫版本

这个案例可以学习到关于pymysql的批量插入、使用BeautifulSoup解析html数据以及requests库的get请求数据的知识。至于pymysql的一些使用,可以参考博客:python框架—->pymysql的使用

import requests
import pymysql.cursors
from bs4 import BeautifulSoup

def get_effect_data(data):
    results = list()
    soup = BeautifulSoup(data, 'html.parser')
    projects = soup.find_all('div', class_='repo-list-item')
    for project in projects:
        writer_project = project.find('a', attrs={'class': 'v-align-middle'})['href'].strip()
        project_language = project.find('div', attrs={'class': 'd-table-cell col-2 text-gray pt-2'}).get_text().strip()
        project_starts = project.find('a', attrs={'class': 'muted-link'}).get_text().strip()
        update_desc = project.find('p', attrs={'class': 'f6 text-gray mb-0 mt-2'}).get_text().strip()

        result = (writer_project.split('/')[1], writer_project.split('/')[2], project_language, project_starts, update_desc)
        results.append(result)
    return results


def get_response_data(page):
    request_url = 'https://github.com/search'
    params = {'o': 'desc', 'q': 'python', 's': 'stars', 'type': 'Repositories', 'p': page}
    resp = requests.get(request_url, params)
    return resp.text


def insert_datas(data):
    connection = pymysql.connect(host='localhost',
                                 user='root',
                                 password='root',
                                 db='test',
                                 charset='utf8mb4',
                                 cursorclass=pymysql.cursors.DictCursor)
    try:
        with connection.cursor() as cursor:
            sql = 'insert into project_info(project_writer, project_name, project_language, project_starts, update_desc) VALUES (%s, %s, %s, %s, %s)'
            cursor.executemany(sql, data)
            connection.commit()
    except:
        connection.close()


if __name__ == '__main__':
    total_page = 2 # 爬虫数据的总页数
    datas = list()
    for page in range(total_page):
        res_data = get_response_data(page + 1)
        data = get_effect_data(res_data)
        datas += data
    insert_datas(datas)

运行完之后,可以在数据库中看到如下的数据:

11tensorflowtensorflowC++78.7kUpdated Nov 22, 2017
12robbyrusselloh-my-zshShell62.2kUpdated Nov 21, 2017
13vintaawesome-pythonPython41.4kUpdated Nov 20, 2017
14jakubroztocilhttpiePython32.7kUpdated Nov 18, 2017
15nvbnthefuckPython32.2kUpdated Nov 17, 2017
16palletsflaskPython31.1kUpdated Nov 15, 2017
17djangodjangoPython29.8kUpdated Nov 22, 2017
18requestsrequestsPython28.7kUpdated Nov 21, 2017
19blueimpjQuery-File-UploadJavaScript27.9kUpdated Nov 20, 2017
20ansibleansiblePython26.8kUpdated Nov 22, 2017
21justjavacfree-programming-books-zh_CNJavaScript24.7kUpdated Nov 16, 2017
22scrapyscrapyPython24kUpdated Nov 22, 2017
23scikit-learnscikit-learnPython23.1kUpdated Nov 22, 2017
24fcholletkerasPython22kUpdated Nov 21, 2017
25donnemartinsystem-design-primerPython21kUpdated Nov 20, 2017
26certbotcertbotPython20.1kUpdated Nov 20, 2017
27aymericdamienTensorFlow-ExamplesJupyter Notebook18.1kUpdated Nov 8, 2017
28tornadowebtornadoPython14.6kUpdated Nov 17, 2017
29pythoncpythonPython14.4kUpdated Nov 22, 2017
30redditredditPython14.2kUpdated Oct 17, 2017

 

友情链接

 

    原文作者:huhx
    原文地址: https://www.cnblogs.com/huhx/p/usepythongithubspider.html
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞