用Python抓取RSS feed

2024年1月5日 299次阅读

我是
Python和编程的新手,所以如果问题非常愚蠢,请原谅.

我一直在跟踪this关于RSS抓取的教程,但是当我试图收集相应链接到正在收集的文章的标题时,我得到了Python的“列表索引超出范围”错误.

这是我的代码：

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

source  = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')

find_title = re.findall(title, source)
find_link = re.findall(link, source)

literate = []
literate[:] = range(1, 16)

for i in literate:
    print find_title[i]
    print find_link[i]

当我只告诉它检索标题时它执行正常,但是当我想检索标题及其相应的链接时立即抛出索引错误.

任何帮助将不胜感激.

最佳答案我认为您使用错误的正则表达式从页面中提取链接.

>>> link = re.compile('<link rel="alternate" type="text/html" href=(.*)')
>>> find_link = re.findall(link, source)
>>> find_link[1].strip()
'"http://www.huffingtonpost.com/andrew-brandt/the-peyton-predicament-pa_b_1271834.html" />'
>>> len(find_link)
15
>>>

看一下你的页面的html源代码,你会发现链接没有包含在内
< LINK>< /链路>图案.

实际上,模式是< link rel =“alternate”type =“text / html”href = links here. 这就是为什么你的正则表达式不起作用的原因.