我正在处理目录中分配的许多.txt文件.从所有这些文件中,我应该如何提取特定单词或文本块(即由正则表达式定义的句子,段落和标记)并将它们放入pandas数据框(即表格格式)中,保留一个名称为每个的列.文件?到目前为止,我创建了这个执行此任务的函数(我知道……它并不完美):
在:
import glob, os, re
import pandas as pd
regex = r'\<the regex>\b'
ind = 'path/dir'
out = 'path/dir'
f ='path/redirected/output/'
def foo(ind, reg, out):
for filename in glob.glob(os.path.join(in_directory, '*.txt')):
with open(filename, 'r') as file:
stuff = re.findall(a_regex, file.read(), re.M)
#my_list = [str([j.split()[0] for j in i]) for i in stuff]
lis = [t[::2] for t in stuff]
cont = ' '.join(map(str, lis))
print(cont)
with open(out, 'a') as f:
print(filename.split('/')[-1] + '\t' + cont, file = f)
foo(directory, regex, out)
然后输出重定向到第三个文件:
日期:
fileName1.txt
fileName2.txt stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
fileName3.txt stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
....
fileNameN.txt stringOrChunk
这就是我从前一个文件创建数据帧的方式(是的,我知道它很糟糕):
import pandas as pd
df = pd.read_csv(/path/of/f/, sep='\t', names = ['file_names','col1'])
df.to_csv('/pathOfNewCSV.csv', index=False, sep='\t')
最后:
file_names col1
0 fileName1.txt NaN
1 fileName2.txt stringOrChunk stringOrChunk stringOrChunk...
2 fileName3.txt stringOrChunk stringOrChunk stringOrChunk...
3 fileName4.txt stringOrChunk
.....
N fileNameN.txt stringOrChunk
那么,任何想法如何以更加pythonic和有效的方式做到这一点?
更新
我上传了一个带有一些文档的.zip作为data,所以如果我们要从文档中提取所有副词,我们应该这样做:
a_regex = r"\w+ly"
directory = '/Users/user/Desktop/Docs/'
output_dir = '/Users/user/Desktop/'
foo(ind, reg, out)
然后,它应该创建一个包含文档所有副词的表:
Files words
doc1.txt
doc2.txt
doc3.txt DIRECTLY PROBABLY EARLY
doc4.txt
有什么想法如何增强上述功能?另外,我不知道这是否是执行此information extraction task的最佳方式(即仅使用正则表达式).如何使用像woosh项目这样的字符串索引器或者nltk呢?
UPDATE
例如,考虑创建一个dataframe,它提取包含单词的所有句子:JESUITS:
Files words1 words2 words3 words4
0 doc1.txt A GOVERNMENT SPOKESMAN HAS ANNOUNCED THAT WITH... NaN NaN NaN
1 doc2.txt 11/12/98 "THERE WAS NO TORTURE OR MISTREATMENT... NaN NaN NaN
2 doc3.txt WHAT WE HAD PREDICTED HAS OCCURRED. CRISTIANI ... SO, THE QUESTION IS: WHO GAVE THE ORDER TO KIL... THE MASSACRE OF THE JESUITS WAS NOT A PERSONAL... LET US REMEMBER THAT AFTER THE MASSSACRE OF TH...
3 doc4.txt IN 11/12/98 OUR VIEW, THE ASSASSINS OF THE JES... THE ASSASSINATION OF THE JESUITS AGAIN CONFIRM... NaN NaN
最佳答案 我不完全确定我理解这个问题,但是这里的片段是用nltk来解决这个问题的最佳努力.
from glob import glob
from os.path import join, split
import nltk
import pandas as pd
dir_name = '/tmp/stackovflw/Docs'
file_to_adverb_dict = {}
nltk_adverb_tags = {'RB', 'RBR', 'RBS'} # taken from nltk.help.upenn_tagset()
for full_file_path in glob(join(dir_name, '*.txt')):
with open(full_file_path, 'rb') as f:
_, file_name = split(full_file_path)
tokens = nltk.word_tokenize(f.read().lower()) # lower -> seems that nltk behaves differently when the text is uppercase - try it...
adverbs_in_file = [token for token, tag in nltk.pos_tag(tokens) if tag in nltk_adverb_tags]
# consider using a "set" here to remove duplicates
file_to_adverb_dict[file_name] = ' '.join(adverbs_in_file).upper() #converting it back to uppercase (your input is all uppercase)
print pd.DataFrame(file_to_adverb_dict.items(), columns=['file_names', 'col1'])
# file_names col1
# 0 doc4.txt PROBABLY ABROAD ALFONSO HOWEVER ALWAYS ALREADY...
# 1 doc1.txt NOT
# 2 doc3.txt DIRECTLY NOT SO SOLELY NOT PROBABLY NOT EVEN N...
# 3 doc2.txt
还有一点需要注意的是,如果您只想在特定文件夹中找到以“ly”结尾的单词,grep就是您的朋友:
$grep -o -i -E '\w+ly' *.txt
doc3.txt:DIRECTLY
doc3.txt:SOLELY
doc3.txt:PROBABLY
doc3.txt:EARLY
doc4.txt:PROBABLY
-o只给你匹配而不是整行
-i无视案例
-E扩展正则表达式
使用awk减少文件名:
$grep -o -i -E '\w+ly' *.txt | awk -F':' '{a[$1]=a[$1] " " $2}END{for( i in a ) print i,"," a[i]}'
doc4.txt , PROBABLY
doc3.txt , DIRECTLY SOLELY PROBABLY EARLY