为什么使用String#count比使用Ruby中的String #chars更快地计算字母？

2019年8月3日 167次阅读

使用以下基准：

def create_genome
  "gattaca" * 100
end

def count_frequency_using_chars(sequence)
  100000.times do
    sequence.chars.group_by{|x| x}.map{|letter, array| [letter, array.count]}
  end
end

def count_frequency_using_count(sequence)
  100000.times do
    ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
  end
end

sequence = create_genome
count_frequency_using_chars(sequence)
count_frequency_using_count(sequence)

我发现,在基于C的Ruby中,对于1.8和1.9.2,使用String #count(字母)比使用Enumerable#group_by和Array#count对它们进行排序和计算快约50倍.我对此感到有些惊讶,因为String#count方法每次迭代读取字符串四次,而后者只读取一次.

我尝试在ruby-prof和perftools.rb下运行代码,并且它们都只是表明String #chars占用了90％的时间,没有分解花费90％的时间.

如果我不得不猜测为什么会有差异,我会说创造7000万个单字符字符串会很昂贵,但我怎么能知道呢？ (更新：字符串#chars不是罪魁祸首 – 请参阅mostly_execute_a_trivial_block的基准测试)

编辑：使用1.9.2补丁级别180的当前基准：

require 'pp'
require 'benchmark'

def create_genome
  "gattaca" * 100
end

ZILLION = 100000

def count_frequency_using_count(sequence)
  ZILLION.times do
    ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
  end
end

def count_frequency_using_chars(sequence)
  ZILLION.times do
    sequence.chars.group_by{|x| x}.map{|letter, array| [letter, array.count]}
  end
end

def count_frequency_using_inject_hash(sequence)
  ZILLION.times do
     sequence.chars.inject(Hash.new(0)) { |h, e| h[e] += 1 ; h }
  end
end

def count_frequency_using_each_with_object(sequence)
  ZILLION.times do
     sequence.chars.each_with_object(Hash.new(0)) { |char, hash| hash[char] += 1}
  end
end


def just_group_by(sequence)
  ZILLION.times do
    sequence.chars.group_by{|x| x}
  end
end

def just_chars_and_trivial_block(sequence)
  ZILLION.times do
    sequence.chars() {}
  end
end

def mainly_execute_a_trivial_block(sequence)
  ZILLION.times do
    sequence.length.times() {}
  end
end

def execute_an_empty_loop_instead(sequence)
  ZILLION.times do
    i = 0
    max = sequence.length
    until i == max
      i += 1
    end
  end
end

sequence = create_genome

puts RUBY_VERSION

Benchmark.bm do |benchmark|
  benchmark.report do
    count_frequency_using_count(sequence)
  end
  benchmark.report do
    count_frequency_using_chars(sequence)
  end
  benchmark.report do
    count_frequency_using_inject_hash(sequence)
  end
  benchmark.report do
    count_frequency_using_each_with_object(sequence)
  end
  benchmark.report do
    just_group_by(sequence)
  end
  benchmark.report do
    just_chars_and_trivial_block(sequence)
  end
  benchmark.report do
    mainly_execute_a_trivial_block(sequence)
  end
  benchmark.report do
    execute_an_empty_for_loop_instead(sequence)
  end
end

结果：

     user     system      total        real
 0.510000   0.000000   0.510000 (  0.508499) # count_frequency_using_count
23.470000   0.030000  23.500000 ( 23.575487) # count_frequency_using_chars
32.770000   0.050000  32.820000 ( 32.874634) # count_frequency_using_inject_hash
31.880000   0.040000  31.920000 ( 31.942437) # count_frequency_using_each_with_object
22.950000   0.030000  22.980000 ( 22.970500) # just_group_by
13.300000   0.020000  13.320000 ( 13.314466) # just_chars_and_trivial_block
 5.660000   0.000000   5.660000 (  5.661516) # mainly_execute_a_trivial_block
 1.930000   0.010000   1.940000 (  1.934861) # execute_an_empty_loop_instead

最佳答案它与ruby内部无关.你正在比较苹果和橘子.

在您的第一个示例中,您将700个字符串分组100000次并查找计数.所以这是你的逻辑问题.不计算.在第二种方法,你只是在计算,

在这两种方法中,您只使用计数

只需更改第一个示例

def count_frequency_using_chars(sequence)
  grouped = sequence.chars.group_by{|x| x}
  100000.times do
    grouped.map{|letter, array| [letter, array.count]}
  end
end

它和你的第二个一样快

编辑

这种方法比count_frequency_using_count快3倍,检查基准测试

  def count_frequency_using_chars_with_single_group(sequence)
    grouped = sequence.chars.group_by{|x| x}
      100000.times do
        grouped.map{|letter, array| [letter, array.count]}
      end
    end

    def count_frequency_using_count(sequence)
      100000.times do
        ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
      end
    end

Benchmark.bm do |benchmark|
  benchmark.report do
    pp count_frequency_using_chars_with_single_group(sequence)
  end
  benchmark.report do
    pp count_frequency_using_count(sequence)
  end
end


    user     system      total        real
  0.410000   0.000000   0.410000 (  0.419100)
  1.330000   0.000000   1.330000 (  1.324431)

安德鲁给你的评论,

每次测量100000个序列的字符组成,而不是一个序列的字符组成100000次,仍然你的计数方法比group_by方法慢得多.正如你所说,我只是对大字符串进行了基准测试

seq = "gattaca" * 10000
#seq length is 70000

arr_seq = (1..10).map {|x| seq}
#10 seq items

并改变了处理多个序列的方法

def count_frequency_using_chars_with_single_group(sequences)
  sequences.each do |sequence|
    grouped = sequence.chars.group_by{|x| x}
    100000.times do
      grouped.map{|letter, array| [letter, array.count]}
    end
  end
end

def count_frequency_using_count(sequence)
  sequences.each do |sequence|
    100000.times do
      ["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
    end
  end
end


Benchmark.bm do |benchmark|
  benchmark.report do
    pp count_frequency_using_chars_with_single_group(arr_seq)
  end
  benchmark.report do
    pp count_frequency_using_count(arr_seq)
  end
end

对于处理100000次,10个序列,每个序列长度为70000

  user        system      total        real
 3.710000     0.040000    3.750000     ( 23.452889)   #modified group_by approach
 1039.180000  6.920000    1046.100000  (1071.374730) #simple char count approach

对于高音量字符串,您的简单字符计数方法比修改后的group_by方法慢47％.我运行了上述基准测试,只有10个序列,每个序列长度为70000.假设有100或1000个序列,简单计数永远不会是一个选项.对？