我正在迭代单词列表以找到单词之间最常用的字符(即在列表中[hello,hank],’h’计为出现两次,而’l’计为出现一次.).
python列表工作正常,但我也在研究NumPy(dtype数组?)和Pandas.看起来Numpy可能是要走的路,但还有其他的套餐要考虑吗?我怎么能更快地使这个功能?
问题代码:
def mostCommon(guessed, li):
count = Counter()
for words in li:
for letters in set(words):
count[letters]+=1
return count.most_common()[:10]
谢谢.
最佳答案 这是使用其观点概念的NumPy方法 –
def tabulate_occurrences(a): # Case sensitive
chars = np.asarray(a).view('S1')
valid_chars = chars[chars!='']
unqchars, count = np.unique(valid_chars, return_counts=1)
return pd.DataFrame({'char':unqchars, 'count':count})
def topNchars(a, N = 10): # Case insensitive
s = np.core.defchararray.lower(a).view('uint8')
unq, count = np.unique(s[s!=0], return_counts=1)
sidx = count.argsort()[-N:][::-1]
h = unq[sidx]
return [str(unichr(i)) for i in h]
样品运行 –
In [322]: a = ['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']
In [323]: tabulate_occurrences(a) # Case sensitive
Out[323]:
char count
0 I 3
1 S 2
2 e 2
3 i 1
4 o 1
5 r 2
6 s 2
7 u 1
8 y 1
In [533]: topNchars(a, 5) # Case insensitive
Out[533]: ['s', 'i', 'r', 'e', 'y']
In [534]: topNchars(a, 10) # Case insensitive
Out[534]: ['s', 'i', 'r', 'e', 'y', 'u', 'o']