python – Perl：有效地计算许多字符串中的许多单词

2019年8月6日 191次阅读

我经常发现自己需要计算单词出现在多个文本字符串中的次数.当我这样做时,我想知道每个单词在每个文本字符串中出现的次数.

我不相信我的方法非常有效,你能给我的任何帮助都会很棒.

通常,我会写一个循环,(1)从txt文件中提取文本作为文本字符串,(2)执行另一个循环,循环遍及我想要使用正则表达式计算的单词来检查a的次数每次将计数推送到数组时都会出现给定的单词,(3)将逗号分隔的计数数组打印到文件中.

这是一个例子：

#create array that holds the list of words I'm looking to count;
@word_list = qw(word1 word2 word3 word4);

#create array that holds the names of the txt files I want to count;
$data_loc = "/data/txt_files_for_counting/"
opendir(DIR1,"$data_loc")||die "CAN'T OPEN DIRECTORY";
my @file_names=readdir(DIR1);


#create place to save results;
$out_path_name = "/output/my_counts.csv";
open (OUT_FILE, ">>", $out_path_name);

#run the loops;
foreach $file(@file_names){
    if ($file=~/^\./)
        {next;}
    #Pull in text from txt filea;
    {
        $P_file = $data_loc."/".$file;
        open (B, "$P_file") or die "can't open the file: $P_file: $!"; 
        $text_of_txt_file = do {local $/; <B>}; 
        close B or die "CANNOT CLOSE $P_file: $!";      
    }

    #preserve the filename so counts are interpretable;
    print OUT_FILE $file;

    foreach $wl_word(@word_list){
        #use regular expression to search for term without any context;
        @finds_p = ();
        @finds_p = $text_of_txt_file =~ m/\b$wl_word\b/g;
        $N_finds = @finds_p;
        print OUT_FILE ",".$N_finds;
    }
    print OUT_FILE ",\n";
}
close(OUT_FILE);

我发现这种方法非常低效(慢),因为txt文件的数量和我想要计算的单词数量增长.

有没有更有效的方法来做到这一点？

是否有perl包这样做？

它在python中更有效吗？ (例如,是否有一个python包可以执行此操作？)

谢谢！

编辑：注意,我不想计算单词的数量,而是存在某些单词.因此,这个问题“What’s the fastest way to count the number of words in a string in Perl?”中的答案并不十分适用.除非我错过了什么.

最佳答案首先 – 你正在使用opendir做什么 – 我不会,而是建议使用glob.

否则 – 还有另一个有用的技巧.为你的“单词”编译一个正则表达式.这有用的原因是因为 – 在正则表达式中使用变量,它需要每次重新编译正则表达式 – 以防变量发生变化.如果它是静态的,那么你就不再需要了.

use strict;
use warnings;
use autodie;

my @words = ( "word1", "word2", "word3", "word4", "word5 word6" );
my $words_regex = join( "|", map ( quotemeta, @words  ));
$words_regex = qr/\b($words_regex)\b/;

open( my $output, ">", "/output/my_counts.csv" );

foreach my $file ( glob("/data/txt_files_for_counting") ) {
    open( my $input, "<", $file );
    my %count_of;
    while (<$input>) {
        foreach my $match (m/$words_regex/g) {
            $count_of{$match}++;
        }
    }
    print {$output} $file, "\n";
    foreach my $word (@words) {
        print {$output} $word, " => ", $count_of{$word} // 0, "\n"; 
    }
    close ( $input );
}

使用这种方法 – 您不再需要将整个文件“啜饮”到内存中以进行处理. (这可能不是一个很大的优势,取决于文件的大小).

当输入数据时：

word1
word2
word3 word4 word5 word6 word2 word5 word4
word4 word5 word word 45 sdasdfasf
word5 word6 
sdfasdf
sadf

输出：

word1 => 1
word2 => 2
word3 => 1
word4 => 3
word5 word6 => 2

但是我会注意到 – 如果你的正则表达式中有重叠的子串,那么这将不会有效 – 尽管如此,你只需要一个不同的正则表达式.