FVN 哈希算法

2019年3月19日 436次阅读来源: 哈希算法

参考：http://www.isthe.com/chongo/tech/comp/fnv/

关于FNV Hash算法的详情，见参考，下面只记录FNV Hash值的分布情况。

FNV hash算法对一个字符串计算，可以得到一个唯一确定的无符号整数值。对于大量的随机输入字符串，比如UUID串，得到的无符号整数值，通过简单的取余运算，基本上是均匀分布的。比如，对100,000个UUID字符串做FNV Hash计算，得到的每个结果值hashValue，都做 hashValue %= 10,000，其结果基本上是在 0 ~ 9,999 范围内均匀分布的。但是请注意，是“基本上“均匀分布，事实上还存在一定的偏差。

Landon在参考页面中详细介绍了直接取余的 Lazy mode mapping method 和 Retry method。

Lazy mode mapping method（以32 bit 、目标范围 0~2142779559 、FNV-1 为例）做法是：

[cpp]
view plain
copy

#define TRUE_HASH_SIZE ((u_int32_t)2142779560) /* range top plus 1 */
#define FNV1_32_INIT ((u_int32_t)2166136261)
u_int32_t hash;
void *data;
size_t data_len;
hash = fnv_32_buf(data, data_len, FNV1_32_INIT);
hash %= TRUE_HASH_SIZE;

Retry method（以 32 bit 、目标范围 0~49999 、FNV-1 为例）：
[cpp]
view plain
copy

#define TRUE_HASH_SIZE ((u_int32_t)50000) /* range top plus 1 */
#define FNV_32_PRIME ((u_int32_t)16777619)
#define FNV1_32_INIT ((u_int32_t)2166136261)
#define MAX_32BIT ((u_int32_t)0xffffffff) /* largest 32 bit unsigned value */
#define RETRY_LEVEL ((MAX_32BIT / TRUE_HASH_SIZE) * TRUE_HASH_SIZE)
u_int32_t hash;
void *data;
size_t data_len;
hash = fnv_32_buf(data, data_len, FNV1_32_INIT);
while (hash >= RETRY_LEVEL) {
hash = (hash * FNV_32_PRIME) + FNV1_32_INIT;
}
hash %= TRUE_HASH_SIZE;

根据Landon的介绍，对于32 bit 的情况，以分布目标 0~999999为例：

The values 0 through 967295 will be created by 4295 different 32-bit FNV hash values whereas the values 967296 through 999999 will be created by only 4294 different 32-bit FNV hash values. In other words, the values 0 through 967295 will occur ~1.0002328 times as often as the values 967296 through 999999.

即 967296~999999 的范围内分布明显比 0~967295 段的分布要密集。

对于64 bit 的情况，以分布目标 0~10000000000000000000 为例：

The values 0 through 9999999999999999999 will be created by 2 different 64-bit FNV hash values whereas the values 10000000000000000000 through 18446744073709551615 will be created by only 1 64-bit FNV hash value.

分布更加不均匀。

但同时，Landon 也提到：

NOTE: This bias issue may not be of concern to you, but we thought we should point out this issue just in case you care. Many applications should / will not care about this bias. Most applications can use the lazy mod mapping method without any problems. Your application, may vary however.

NOTE: One may substitute the FNV-1a hash for the FNV-1 hash in any of the lazy mod mapping method examples. Some people believe that FNV-1a lazy mod mapping method gives then slightly better dispersion without any impact on CPU performance. See the FNV-1a hash description for more information.

就是说，这样的”些许“分布不均匀的情况，对大多数应用来说，是无关紧要的。同时，在不增加CPU负载的情况下，相比FNV-1 ，使用FNV-1a 的 lazy mode mapping method 得到的分布情况要稍微好一些。

================================================

附上32bit、FNV-1的示例代码

（需要先安装 libuuid，如 yum install libuuid-devel.x86_64）

[cpp]
view plain
copy

#include <iostream>
#include <string>
#include <uuid/uuid.h>
#include <stdlib.h>
using namespace std;
// typedef unsigned long long UINT64;
typedef unsigned int DWORD;
const int range = 8;
DWORD Hash4Bytes(const string &key)
{
const char * first = key.c_str();
DWORD length = key.size();
DWORD result = 2166136261;
for(; length > 0; –length) {
result ^= (std::size_t)*first++;
result *= 16777619;
}
return result;
}
int Disperse(const string &key)
{
DWORD hash = Hash4Bytes(key);
int index = hash % range;
return index;
}
int main(int argc, char * argv[]) {
long key_count = 10000;
if(argc > 1) {
key_count = atol(argv[1]);
}
uuid_t uuid;
char str[36];
long stat[range];
for(int i = 0; i < range; ++i) {
stat[i] = 0;
}
for(int i = 0; i < key_count; ++i) {
uuid_generate(uuid);
uuid_unparse(uuid, str);
stat[Disperse(str)]++;
}
cout << “Range: 0 ~ “ << range – 1 << endl;
cout << “Key count: “ << key_count << endl;
for(int i = 0; i < range; ++i) {
cout << “Index #” << i << “: “ << stat[i] << endl;
}
cout << endl;
return 0;
}

执行结果：
[plain]
view plain
copy

[root@amons02 fnv]# ./t 1000000
Range: 0 ~ 7
Key count: 1000000
Index #0: 124995
Index #1: 125483
Index #2: 124735
Index #3: 124692
Index #4: 124920
Index #5: 124956
Index #6: 124912
Index #7: 125307

对于上面的代码，如果执行 ./ 1000000 的目的是”将1000000个随机的UUID字符串一一放入0~999999“的范围内，那么从结果看，分布情况是可以接受的。

注意：

上面代码中的Hash4Bytes() 函数，它的返回值类型必须是32位无符号整型，并且函数内部的result 变量也必须是32位无符号整型，因为我们用的是32bit的FNV-1算法。不要用std::size_t，因为在64 bit 机器上，sizeof(std::size_t) 是8！

    原文作者：哈希算法
    原文地址: https://blog.csdn.net/abccheng/article/details/72650581
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。