目标:给定一篇PDF格式的文章,用python解析其内容,并使用正则表达式提取出其中的参考文献内容。本文中假设参考文献内容以[1] [2] 这样的索引编写。
必要条件:安装解析PDF文件的python软件包 pdfminer http://www.unixuser.org/~euske/python/pdfminer/index.html
代码:
<p><span style="font-family: Arial, Helvetica, sans-serif;">#/usr/bin/env python</span></p>#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.converter import PDFPageAggregator
import re
# Open a PDF file.
fp = open('your input pdf file', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
document = PDFDocument(parser)
# Process each page contained in the document.
text_content = []
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
text_content.append(lt_obj.get_text())
else:
pass
# text_content 中每一个元素存储了一行文字
total_text = ''.join(text_content).replace("\n","")
#从字符串中解析出参考文献
file = open("the file your want to save","w")
p = re.compile('\[\d+\][^\[\]]*\d\.')
m = p.findall(total_text)
for i in m:
#print i
if i.startswith("["):
file.write(str(i))
file.write("\n")
file.close()
注:当前版本还有一个待优化的点,即最后用正则提取参考文献时,会将正文中的一些参考部分也误提取了出来。待后续有机会再优化吧。