Python – 从一行中的给定点查找前后五个单词的最佳代码

我正在尝试编写代码来查找特定短语两侧的5个单词.很容易,但我必须在大量数据上执行此操作,因此代码需要是最佳的!

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                        matrix[trigram][mostfreq[item]] += 1

有什么建议可以更快地做到这一点?可能是我在这里使用完全错误的数据结构. trigram_split()只给出行中的所有三元组(这是我需要为其创建向量的单位). “Trigrams”基本上是一个大约一百万个三元组的列表,我关注的是创建向量. Window获取trigram之前和之后的5个单词(如果该trigram在列表中),然后检查它们是否在列表MostFreq(这是一个1000字的字典作为键,每个对应一个整数[ 0-100]作为储值).然后,这用于更新Matrix(这是一个带有列表([0] * 1000)作为存储值的字典).伪矩阵中的对应值以这种方式递增.

最佳答案 在权衡各种方法时要考虑的几个重要因素:

>多线与单线
>线的长度
>搜索模式的长度
>搜索匹配率
>如果之前/之后少于5个单词怎么办
>如何处理非单词,非空格字符(换行符和标点符号)
>不区分大小写?
>如何处理重叠比赛?例如,如果文字是我们是说NI的骑士! NI NI NI NI NI NI NI NI和你搜索NI你会回来什么?这会发生在你身上吗?
>如果您的数据中包含###,该怎么办?
>你宁愿错过一些,还是回复错误的结果?可能存在一些权衡,特别是对于杂乱的现实世界数据.

你可以尝试正则表达式……

import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]

替代方法,导致更糟糕的结果IMO ……

def splitAndFind(text, phrase):
    text2 = text.replace(phrase, "###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])

在iPython中,您可以轻松地计时:

timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop

timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop
点赞