php – 理解一致的哈希

2023年5月22日 245次阅读

在过去的几天里,我一直在研究
PHP的一致哈希算法.我希望更好地理解一致的哈希实际上是如何工作的,所以我可以在一些即将到来的项目中使用它.在我看来,
Flexihash实际上是唯一一个易于理解的纯PHP实现,所以我从中得到了一些注释.

我已经创建了一个自己的小算法来尝试理解它是如何工作的,以及如何让它尽可能快地工作.我很惊讶我的算法与Flexihash相比有多快,这让我想知道我的实现是否存在某些方面的缺陷,或者我可能没有抓住整个概念的关键部分.

在100万个连续键(0到1,000,000)的迭代中,下面列出了速度差异.将显示每个节点以显示实际散列到该特定节点的密钥数.

Flexihash:
 Time: 269 seconds
Speed: 3714 hashes/sec

 node1: 80729
 node2: 88390
 node3: 103623
 node4: 112338
 node5: 79893
 node6: 85377
 node7: 80966
 node8: 134462
 node9: 117046
node10: 117176

My implementation:
 Time: 3 seconds
Speed: 265589 hashes/sec

 node1: 100152
 node2: 100018
 node3: 99884
 node4: 99889
 node5: 100057
 node6: 100030
 node7: 99948
 node8: 99918
 node9: 100011
node10: 100093

这是我当前实现的散列算法.

class Base_Hasher_Consistent
{
    protected $_nodes;

    public function __construct($nodes=NULL)
    {
        $this->_nodes = array();

        $node_count = count($nodes);
        $hashes_per_node = (int)(PHP_INT_MAX / $node_count);

        $old_hash_count = 0;

        foreach($nodes as $node){
            if(!($node == $nodes[$node_count-1])){
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => $old_hash_count + $hashes_per_node - 1
                );

                $old_hash_count += $hashes_per_node;
            }else{
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => PHP_INT_MAX
                );
            }
        }
    }

    public function hashToNode($data_key,$count=0)
    {
        $hash_code = (int)abs(crc32($data_key));

        foreach($this->_nodes as $node){
            if($hash_code >= $node['begin']){
                if($hash_code <= $node['end']){
                    return($node['name']);
                }
            }
        }
    }
}

我错过了什么,或者算法真的比Flexihash快吗？另外,我知道Flexihash支持查找多个节点,因此我不确定它是否与它有任何关系.

我想要一些保证,我理解一致的哈希是如何工作的,或者可能链接到一些真正解释它的文章.

谢谢！

最佳答案那么你想知道crc32如何工作,或者你只是想知道如何编写一个好的“桶”实现？

你的桶实现看起来很好.如果你刚刚做了,你可以更快地做到：

$bucket_index = floor($hash_code / $hashes_per_node);
return $this->_nodes[$bucket_index]['name'];

您编写的’算法’只会使$节点数量的桶,并计算出$data_key应该在哪些桶中.你正在使用的实际哈希算法是crc32,如果你正在做桶,它可能不是一个理想的算法.

如果你想知道crc32是如何工作的以及为什么哈希是一致的..在维基百科上查找它.据我所知,没有不一致的哈希,所以所有哈希算法都是定义一致的.

散列算法的一个特征是它可以生成与类似的data_keys非常不同的散列.这对于crc32来说是正确的,因为crc32旨在检测微小的变化.它不能保证的是,由此产生的哈希值“传播良好”.由于您正在执行存储桶,因此您需要一种散列算法,该算法具有在整个频谱范围内生成散列的特定属性.对于crc32,它可能只是巧合地工作.我只是用md5.