每个月我都需要从.pdf文件中提取一些数据来创建Excel表格.
我能够将.pdf文件转换为文本,但我不确定如何提取和保存我想要的特定信息.现在我有这个代码:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
str = retstr.getvalue()
fstr += str
fp.close()
device.close()
retstr.close()
return fstr
print convert_pdf_to_txt("FA20150518.pdf")
这就是结果:
>>>
AVILA 72, VALLDOREIX
08197 SANT CUGAT DEL VALLES
(BARCELONA)
TELF: 935441851
NIF: B65512725
EMAIL: buendialogistica@gmail.com
JOSE LUIS MARTINEZ LOPEZ
AVDA. DEL ESLA, 33-D
24240 SANTA MARIA DEL PARAMO
LEON
TELF: 600871170
FECHA
17/06/15
FACTURA
20150518
CLIENTE
43000335
N.I.F.
71548163 B
PÁG.
1
Nº VIAJE
RUTA
DESTINATARIO / REFERENCIA
KG
BULTOS
IMPORTE
2015064210-08/06/15
CERDANYOLA DEL VALLES -> VINAROS
FERRER ALIMENTACION - VINAROZ
2,000.0
1
150,00
TOTAL IMP.
%
IMPORTE
BASE
150,00
150,00
%
21,00
IVA
%
REC.
TOTAL FRA.
(€)
31,50
181,50
Eur
Forma Pago:
Banco:
CONTADO
Vencimientos:
17/06/15
181,50
好的,现在我在变量convert_pdf_to_txt中有了文本.
我想提取这些信息:客户,账单数量,价格,到期日期和支付方式.
客户名称始终显示“EMAIL:buendialogistica@gmail.com”
账单数量总是下降“FACTURA”
价格总是下降两行“Vencimientos:”
截止日期总是下降“Vencimientos:”
支付方式永远落在“银行:”
我想在做这样的事情.如果我可以将此文本转换为列表,可以执行以下操作:
搜索客户:
i=0
while i < lengthlist
if listitem[i] == "EMAIL: buendialogistica@gmail.com"
i+1
Customer = listitem[i]
i = lengthlist
else:
i+1
搜寻账单号码:
i=0
while i < lengthlist
if listitem[i] == "FACTURA"
i+1
Customer = listitem[i]
i = lengthlist
else:
i+1
在我不知道如何在Excel中保存但我确信我可以在论坛中找到示例但首先我需要只提取这些数据.
最佳答案 你有正确的想法
string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
if 'EMAIL:' in lines[i]:
custData['Name'] = lines[i+1]
elif 'FACTURA' in lines[i]:
custData['BillNumber'] = lines[i+1]
elif 'Vencimientos:' in lines[i]:
custData['price'] = lines[i+2]
elif 'Banco:' in lines[i]:
custData['paymentType'] = lines[i+1]
print(custData)