解读Objective-C中的[NSString hash]方法

版权声明:本文源自简书【九昍】,欢迎转载,转载请务必注明出处: http://www.jianshu.com/p/92d83bd10821

最近我们所做的iOS SDK有一个新需求,需要在SDK发生问题是上报错误日志到服务器。这些数据可能会有重复,如果全部上报会导致不必要的流量消耗,所以想通过hash来做一下去重处理,刚好发现NSString有一个@property (readonly) NSUInteger hash;

官方的注释是下面这样的:

Returns an integer that can be used as a table address in a hash table structure.
If two objects are equal (as determined by the isEqual: method), they must have the same hash value. This last point is particularly important if you define hash in a subclass and intend to put instances of that subclass into a collection.
If a mutable object is added to a collection that uses hash values to determine the object’s position in the collection, the value returned by the hash method of the object must not change while the object is in the collection. Therefore, either the hash method must not rely on any of the object’s internal state information or you must make sure the object’s internal state information does not change while the object is in the collection. Thus, for example, a mutable dictionary can be put in a hash table but you must not change it while it is in there. (Note that it can be difficult to know whether or not a given object is in a collection.)

这段解释大概分成三部分,hash方法的返回值是一个可以用作哈希表中对象地址的interger类型。
遵循与java一致的hashCode原则:

  1. 如果两个对象相同,那么它们的hashCode值一定要相同;
  2. 如果两个对象的hashCode相同,它们并不一定相同 。
  3. 当对象作为以hash值决定对象位置的collection时,必须保证该对象的hash值不发生改变。

经过测试后发现,对于NSMutableString类型的对象,加入到NSHashTable以后再修改NSMutableString对象,其hash值也会发生改变。所以根据上面的解释,对于这些系统对象,我们必须自己保证对象在加入这些collection以后hash值不变。

那这个hash方法到底能不能用,这解释实在是太模糊,只能自行百度了,然后查到了这样的解释:

At least there are special circumstances for which this unreliability kicks in.
Comparing [a hash] and [b hash] of two different NSString is safe when:
the strings' length is shorter or equal to 96 characters.
[a length] is different to [b length].
the concatinated first, middle, and last 32 characters of a differ to the concatinated components of b.
Otherwise every difference between the first and the middle 32 chars, as well as every difference between the middle and the last 32 characters, are not used while producing the [NSString hash] value.

大致意思就是说,[NSString hash]这个方法对<=96个字符的字符串是安全的,如果比96个字符长,会大大增加碰撞的概率。

为什么只有这96个字符有效?我们从源码里面找到根据。

在CFString.c里面,找到了这样一个方法。
CFHashCode __CFStringHash(CFTypeRef cf) {
    /* !!! We do not need an IsString assertion here, as this is called by the CFBase runtime only */
    CFStringRef str = (CFStringRef)cf;
    const uint8_t *contents = (uint8_t *)__CFStrContents(str);
    CFIndex len = __CFStrLength2(str, contents);

    if (__CFStrIsEightBit(str)) {
        contents += __CFStrSkipAnyLengthByte(str);
        return __CFStrHashEightBit(contents, len);
    } else {
        // 如果是unicode字符串
        return __CFStrHashCharacters((const UniChar *)contents, len, len);
    }
}

__CFStrHashEightBit内部和__CFStrHashCharacters大体相同,我们只看其中一个就可以了

#define HashEverythingLimit 96

#define HashNextFourUniChars(accessStart, accessEnd, pointer) \
    {result = result * 67503105 + (accessStart 0 accessEnd) * 16974593  + (accessStart 1 accessEnd) * 66049  + (accessStart 2 accessEnd) * 257 + (accessStart 3 accessEnd); pointer += 4;}

#define HashNextUniChar(accessStart, accessEnd, pointer) \
    {result = result * 257 + (accessStart 0 accessEnd); pointer++;}


CF_INLINE CFHashCode __CFStrHashCharacters(const UniChar *uContents, CFIndex len, CFIndex actualLen) {
    CFHashCode result = actualLen;
    // ****X 这里HashEverythingLimit = 96
    if (len <= HashEverythingLimit) {
        // ****X 若字符串长度在96以内,对所有的字符做hash运算得到一个结果
        
        const UniChar *end4 = uContents + (len & ~3);
        const UniChar *end = uContents + len;
        while (uContents < end4) HashNextFourUniChars(uContents[, ], uContents);    // First count in fours
        while (uContents < end) HashNextUniChar(uContents[, ], uContents);      // Then for the last <4 chars, count in ones...
    } else {
        // ****X 若字符串长度超过96
        
        const UniChar *contents, *end;
        // ****X 取前32个字符做hash运算
    contents = uContents;
        end = contents + 32;
        while (contents < end) HashNextFourUniChars(contents[, ], contents);
        // ****X 取中间32个字符做hash运算
    contents = uContents + (len >> 1) - 16;
        end = contents + 32;
        while (contents < end) HashNextFourUniChars(contents[, ], contents);
        // ****X 取最后32个字符做hash运算
    end = uContents + len;
        contents = end - 32;
        while (contents < end) HashNextFourUniChars(contents[, ], contents);
    }
    return result + (result << (actualLen & 31));
}

所以对于[NSString hash],如果长度大于96,只有前、中、后32个字符做了哈希运算,也就是说在这些字符相同的情况下,其他任意位置的字符发生改变,Hash值都不会变,下面来验证下。

从上图可以看出,当我们修改前面提到的96个字符中的任意字符,hash就会发生改变;而当我们修改这些字符以外的字符,hash值不会有任何变化。

接着回到最开始的需求,我们的需求是对数据做简单的去重处理,通过观察我们要上传的数据,可以发现一般情况下这些数据的前、中、后共96个字符是不同的,即使相同最多只会导致上传重复数据,所以使用[NSString hash]就可以满足我们对数据去重的需求。

    原文作者:九昍
    原文地址: https://www.jianshu.com/p/92d83bd10821
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞