爬虫实战（1）——爬取校内网招聘信息的名称

2019年6月16日 109次阅读来源: 公子强

最近焦虑感比较强，在思考自己以后从事的工作，与其凭空思考，不如来看点实际的数据，于是爬取了校内网的招聘信息研究下。

编写爬虫之前，我们需要先思考爬虫需要干什么、目标网站有什么特点，以及根据目标网站的数据量和数据特点选择合适的架构。编写爬虫之前，推荐使用Chrome的开发者工具来观察网页结构。在Windows和Linux，对应的快捷键是”F12″。效果如下：

《爬虫实战（1）——爬取校内网招聘信息的名称》

OK，可以看出，这个页面其实有一个列表，其中放着20条招聘信息。我们选中某一条信息，右键选择检查即可查看选中条目的HTML结构。如下图所示：《爬虫实战（1）——爬取校内网招聘信息的名称》

《爬虫实战（1）——爬取校内网招聘信息的名称》

到这一步，我们已经得到的信息有如下：

每页有20条招聘信息
招聘列表在页面上的位置为tbody的标签中。
每条招聘信息放在这个tr标签里

完整代码如下：

# coding: utf-8

# In[105]:

import codecs
import os 
import requests
from bs4 import BeautifulSoup

os.chdir("D:\\python study\\code")

DOWNLOAD_URL = 'http://www.cc98.org/list.asp?boardid=235&page=2&action='

def download_page(url):
    return requests.get(url).content

def parse_html(html):
    soup = BeautifulSoup(html)
    title_list_soup = soup.find('tbody')

    title_name_list = []
    
    for title_tr in title_list_soup.find_all('tr', attrs={'style': 'vertical-align: middle;'}):
        detail = title_tr.find('td', attrs={'style': 'text-align: justify;'})
        title_name = detail.find('span').getText()
        title_name_list.append(title_name)

    next_page = soup.find('div', attrs={'align': 'right'}).find('a',text="[下一页]")
    
    while i:
        return title_name_list, DOWNLOAD_URL + next_page['href']
    return title_name_list, DOWNLOAD_URL + next_page['href']

def main():
    
    url = DOWNLOAD_URL
    
    with codecs.open('titles1', 'wb', encoding='utf-8') as fp:
        for i in range(4):
            html = download_page(url)
            titles1, url = parse_html(html)
            fp.write(u'{titles1}\n'.format(titles1='\n'.join(titles1)))
    
if __name__ == '__main__':
    main()

显示的部分结果如下：

《爬虫实战（1）——爬取校内网招聘信息的名称》

    原文作者：公子强
    原文地址: https://blog.csdn.net/m0_37324740/article/details/78027294
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。