python – 用于迭代字符串列表中的字符的最快对象

2023年11月8日 223次阅读

我正在迭代单词列表以找到单词之间最常用的字符(即在列表中[hello,hank],’h’计为出现两次,而’l’计为出现一次.).
python列表工作正常,但我也在研究NumPy(dtype数组？)和Pandas.看起来Numpy可能是要走的路,但还有其他的套餐要考虑吗？我怎么能更快地使这个功能？

问题代码：

def mostCommon(guessed, li):
    count = Counter()
    for words in li:
          for letters in set(words):
              count[letters]+=1
    return count.most_common()[:10]

谢谢.

最佳答案这是使用其观点概念的NumPy方法 –

def tabulate_occurrences(a):           # Case sensitive
    chars = np.asarray(a).view('S1')
    valid_chars = chars[chars!='']
    unqchars, count = np.unique(valid_chars, return_counts=1)
    return pd.DataFrame({'char':unqchars, 'count':count})

def topNchars(a, N = 10):               # Case insensitive
    s = np.core.defchararray.lower(a).view('uint8')
    unq, count = np.unique(s[s!=0], return_counts=1)
    sidx = count.argsort()[-N:][::-1]
    h = unq[sidx]
    return [str(unichr(i)) for i in h]

样品运行 –

In [322]: a = ['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']

In [323]: tabulate_occurrences(a) # Case sensitive
Out[323]: 
  char  count
0    I      3
1    S      2
2    e      2
3    i      1
4    o      1
5    r      2
6    s      2
7    u      1
8    y      1

In [533]: topNchars(a, 5)         # Case insensitive
Out[533]: ['s', 'i', 'r', 'e', 'y']

In [534]: topNchars(a, 10)        # Case insensitive
Out[534]: ['s', 'i', 'r', 'e', 'y', 'u', 'o']