Python中的计算时间要快得多

我有这个代码,根据jaccard距离重新排序公司名称列表.它工作正常.

但是,如果我将此代码用于30,000家公司的名称,则计算时间太长.例如,我在2小时前运行此代码,它仍在处理中.

如何以更快的速度运行此代码?也许一些图书馆或改变结构?

      def jack(a,b):
            x=a.split()
            y=b.split()
            k=float(len(set(x)&set(y)))/float(len((set(x) | set(y))))
            return k

t=['bancorp', 'bancorp', 'bancorp ali', 'bancorp puno', 'bancorp amo', 'gas eu', 'gas', 'profeta', 'bancorp america', 'uni', 'gas for', 'gas tr']

out = [] # this will be the sorted list
for index, val1 in enumerate(t): # work through each item in the original list
    if val1 not in out: # if we haven't already put this item in the new list
        out.append(val1) # put this item in the new list
    for val2 in t[index+1:]: # search the rest of the list
        if val2 not in out: # if we haven't already put this item in the new list
            if jack(val1, val2) >= 0.5: # and the new item is close to the current item
                out.append(val2) # add the new item too

然后,输出是:

print out

['bancorp', 'bancorp ali', 'bancorp puno', 'bancorp amo', 'bancorp america', 'gas eu', 'gas', 'gas for', 'gas tr', 'profeta', 'uni']

最佳答案 根据上面其他人的建议,我推出了自己的更好的代码.正如Niklas B.所指出的,主要的改进是从O(n ^ 3)减少到O(n ^ 2).

from __future__ import division
import itertools

def jack(a,b):
    #print "jack", a, b, len(a & b) / len(a | b)
    return len(a & b) / len(a | b)

def jacksort(t):
    # precompute the word set of each input
    sts = [(i, it, set(it.split())) for i, it in enumerate(t)]
    # allow O(1) testing for 'word already in output'
    os = set()
    out = [] # this will be the sorted list
    # work through each item in the original list
    for index, val1, sval1 in sts:
        if not val1 in os:
            out.append(val1) # put this item in the new list
            os.add(val1)
        for index2, val2, sval2 in itertools.islice(sts, index+1, len(sts)):
            # search the rest of the list
            if val2 in os: continue
            if jack(sval1, sval2) >= 0.5:
                # the new item is close to the current item
                out.append(val2) # add the new item too
                os.add(val2)

    return out

def main(n=100):
    t=['bancorp', 'bancorp', 'bancorp ali', 'bancorp puno', 'bancorp amo',
        'gas eu', 'gas', 'profeta', 'bancorp america', 'uni', 'gas for',
        'gas tr']
    t += [" ".join(w.split()) for w in open("/usr/share/dict/words").read().split()]
    t = t[:n]
    jacksort(t)

其中n是要测试的输入大小.我的/usr/share / dict / words是/usr/share / dict / american-english来自debian package wamerican version 7.1-1.

我的代码的一些时间:

   10: 10 loops, best of 3: 30.6 msec per loop
   20: 10 loops, best of 3: 28.9 msec per loop
   50: 10 loops, best of 3: 29.6 msec per loop
  100: 10 loops, best of 3: 31.7 msec per loop
  200: 10 loops, best of 3: 38.7 msec per loop
  500: 10 loops, best of 3: 85.1 msec per loop
 1000: 10 loops, best of 3: 261 msec per loop
 2000: 10 loops, best of 3: 1.01 sec per loop
 5000: 10 loops, best of 3: 6.16 sec per loop
10000: 10 loops, best of 3: 25.3 sec per loop

与适用于我的测试工具的原始代码相比:

   10: 10 loops, best of 3: 34.1 msec per loop
   20: 10 loops, best of 3: 34.3 msec per loop
   50: 10 loops, best of 3: 33.9 msec per loop
  100: 10 loops, best of 3: 43.1 msec per loop
  200: 10 loops, best of 3: 74.9 msec per loop
  500: 10 loops, best of 3: 415 msec per loop
 1000: 10 loops, best of 3: 2.35 sec per loop
 2000: 10 loops, best of 3: 14.8 sec per loop
 5000: [did not finish while preparing this post]

在命令行使用timeit生成数字:

$for i in 10 20 50 100 200 500 1000 2000 5000 10000; do printf "%5d: " $i; python -mtimeit 'import ojack as jack' 'jack.main('$i')'; done

在Debian amd64,i5-3320m CPU上的Python版本2.7.5-5上进行了测试.这两个函数似乎都在执行时间上增长,就像大O符号声明一样.

我的输入与你的输入不同,因为我的“单词”几乎都是单个字母,尽管每个术语都会有5-7个“单词”.我不知道在实践中这是否意味着你的输入会表现更差或更好,因为你没有对你的输入表现得太多.事实上,我用n = 30000进行了一次运行,得到了393秒.这比O(n ^ 2)预测的要多得多(25.3 * 9 = 227.7),所以尽管有O(1)声称的Python集,但在那里必须有一个log n潜伏.

点赞