我是
python3的新手,我对解决这个问题的不同方法提出了疑问.关于使用不同数据结构的问题.
我的问题是如何比较不同采样技术的权衡
我在程序中使用字典数据结构来首先解决这个问题.然后我尝试仅使用列表数据结构重写它.我试着考虑排序的好处,我不知道这两种方法之间有什么区别.它似乎没有在两种方法之间产生那么大的差异.
方法1.我使用字典在直方图中创建直方图键和值对
方法2以字符串格式接收源文本并返回列表列表
其中每个子列表中的第一个元素是单词,第二个元素是单词
元素是源文本中的频率
# This program Analyze word frequency in a histogram
# sample words according to their observed frequencies
# takes in a source text in string format and returns a dictionary
# in which each key is a unique word and its value is that word's
# frequency in the source text
import sys
import re
import random
import time
def histogram(source_text):
histogram = {}
# removing any sort of string, removing any other special character
for word in source_text.split():
word = re.sub('[.,:;!-[]?', '', word)
if word in histogram:
histogram[word] += 1
else:
histogram[word] = 1
return histogram
def random_word(histogram):
probability = 0
rand_index = random.randint(1, sum(histogram.values()))
# Algorithm 1
for (key, value) in histogram.items():
for num in range(1, value + 1):
if probability == rand_index:
if key in outcome_gram:
outcome_gram[key] += 1
else:
outcome_gram[key] = 1
# return outcome_gram
return key
else:
probability += 1
# Method 2 takes in a source text in string format and returns a list #of lists
# in which the first element in each sublist is the word and the #second element is its frequency in the source texts
# Algorithm 2
# for word in histogram:
# probability += histogram[word]
# if probability >= rand_index:
# if word in outcome_gram:
# outcome_gram[word] += 1
# else:
# outcome_gram[word] = 1
# return word
if __name__ == "__main__":
outcome_gram = {}
dict = open('./fish.txt', 'r')
text = dict.read()
dict.close()
hist_dict = histogram(text)
for number in range(1, 100000):
random_word(hist_dict)
最佳答案 哪个更具可读性?我认为字典版本更容易理解.另请注意,您可以将第二个方法中的2元组列表传递给dict构造函数,以重现第一个方法的输出.这应该让您了解这两种实现如何至少在某些方面大致相同.除非这会导致性能问题,否则我不会太担心它.
Python的优势在于您可以用可读的方式以五行编写相同的代码.
import re, random
from collections import Counter
def histogram(text):
clean_text = re.sub('[.,:;!-[]?', '', text)
words = clean_text.split()
return Counter(words)
def random_word(histogram):
words, frequencies = zip(*histogram.items())
return random.choices(words, frequencies, k=1)