Python字典实现

2019年4月21日 270次阅读来源: Baloneo

这篇文章阐述了Python2语言中的字典的实现。

Original: https://www.laurentluce.com/posts/python-dictionary-implementation/

字典使用键作为下标，可以把字典看作是关联数组。添加3对键值对到一个字典如下：

>>> d = {'a': 1, 'b': 2}
>>> d['c'] = 3
>>> d
{'a': 1, 'b': 2, 'c': 3}

值可以使用以下的方法获取到：

>>> d['a']
1
>>> d['b']
2
>>> d['c']
3
>>> d['d']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'd'

键d不存在所以抛出了一个KeyError异常。

Hash表

Python的字典实现使用的是hash表，它是一个数组，他的索引是通过键的hash函数获得的。hash函数的目标是为了键均衡的分配，在整个数组中。一个好的hash函数的冲突（collisions）数量应该是最小的，比如：不同的键有相同的hash值。Python没有这类的hash函数。在通常情况下，最重要的hash函数(字符串, 整数)是非常有规则的。

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]

在之后的篇幅中，我们现在假设我们使用的key都是string类型的。string的hash函数在python定义看起来像这样的:

arguments: string object
returns: hash
function string_hash:
     if hash cached:
         return it
     set len to string's length
     initialize var p pointing to 1st char of string object
     set x to value pointed by p left shifted by 7 bits
     while len >= 0:
         set var x to (1000003 * x) xor value pointed by p
         increment pointer p
     set x to x xor length of string object
     cache x as the hash so we don't need to calculate it again
     return x as the hash

真实的C实现是这样的：


static long
string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

#ifdef Py_DEBUG
    assert(_Py_HashSecret_Initialized);
#endif
    if (a->ob_shash != -1)
        return a->ob_shash;
    len = Py_SIZE(a);  // 设置len的长度为传进来的字符串的长度
    /*
      We make the hash of the empty string be 0, rather than using
      (prefix ^ suffix), since this slightly obfuscates the hash secret
    */
    if (len == 0) { // 长度为0 return
        a->ob_shash = 0;
        return 0;
    }
    p = (unsigned char *) a->ob_sval;  // 设置p指向a的实际内容(字符串)
    x = _Py_HashSecret.prefix;
    x ^= *p << 7;  // 进行异或 左移操作 赋值给x
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    x ^= _Py_HashSecret.suffix;
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

以hash(‘a’)为例，它的hash值为12416037344

97 << 7 = 12416
x = (1000003 * 12416) ^ 97 = 12416037345
x ^1 = 12416037344

如果数组的大小为x，用来存储键值对的话，我们使用一个mask值，mask=x-1,使用mask来计算对应槽索引来获得值。这让计算槽索引变得很快，改变数组大小这种机制可以提高找到槽的概率。意味着简单的计算是有意义的。如果数组的大小为8,hash(‘a’) & 7 = 0, ‘b’的index变为3,‘c’的index变为2,但是‘z’也是3,所以这里发生了冲突。

《Python字典实现》 hash.png

我们可以看到python的hash函数在连续的key中是工作的非常好，因为经常有这样类型的数据。但是，一但添加了z这个key，就出现了冲突，因为z并不连续。
我们可以使用链表来存储相同的hash值解决冲突，但是这会增加查找实现，不再是平均Ｏ（1）,下一部分会解释在Ｐython中是如何解决这样的冲突的。

开放寻址法(Open addressing)

Open addressing 是一种通过探测来解决冲突的一种方法。在z 这个key的情况下，槽的索引3已经被使用了，所以我们需要探测出一个不同的索引并且这个索引没有其它数据，添加键值对它的平均时间复杂度也是O(1)。

二次探测用于查找一个空闲的槽，代码如下：

j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
use j % 2**i as the next table index;

在CPython字典实现中，lookdict会返回一个在字典中可以被插入的位置的指针。


static PyDictEntry *
lookdict(PyDictObject *mp, PyObject *key, register long hash)
{
    register size_t i;
    register size_t perturb;
    register PyDictEntry *freeslot;
    register size_t mask = (size_t)mp->ma_mask;
    PyDictEntry *ep0 = mp->ma_table; // ma_table初始化为一个size为8的PyDictEntry数组
    register PyDictEntry *ep;
    register int cmp;
    PyObject *startkey;

    i = (size_t)hash & mask;// mask默认为数组size-1 8-1=7
    ep = &ep0[i];  // 探测第0个 键值对所在的位置 ep=dict[0]
    if (ep->me_key == NULL || ep->me_key == key) // 第0个entry的key为null或者key已经设置过了(判断内存相等) active状态,key =int(12345) 当想更新这个的时候两个key 12345的内存是不等的 会走到下面的比较hash的逻辑
        return ep;

    if (ep->me_key == dummy) // dummy状态(这个entry以前被用过) 
        freeslot = ep;
    else {
        if (ep->me_hash == hash) { // active状态的检测 entry没有被使用过
            startkey = ep->me_key;
            Py_INCREF(startkey);
            cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
            Py_DECREF(startkey);
            if (cmp < 0)
                return NULL;
            if (ep0 == mp->ma_table && ep->me_key == startkey) {
                if (cmp > 0)
                    return ep;
            }
            else {
                /* The compare did major nasty stuff to the
                 * dict:  start over.
                 * XXX A clever adversary could prevent this
                 * XXX from terminating.
                 */
                return lookdict(mp, key, hash);
            }
        }
        freeslot = NULL;
    }
// 健被使用过 这种情况是最不可能发生的 所以放在最后， 寻找下一个可用的键值对entry(可用的内存空间)
    /* In the loop, me_key == dummy is by far (factor of 100s) the
       least likely outcome, so test for that last.  PERTURB_SHIFT=5 */
    for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
        i = (i << 2) + i + perturb + 1;
        ep = &ep0[i & mask];
        if (ep->me_key == NULL)
            return freeslot == NULL ? ep : freeslot;
        if (ep->me_key == key)
            return ep;
        if (ep->me_hash == hash && ep->me_key != dummy) {
            startkey = ep->me_key;
            Py_INCREF(startkey);
            cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
            Py_DECREF(startkey);
            if (cmp < 0)
                return NULL;
            if (ep0 == mp->ma_table && ep->me_key == startkey) {
                if (cmp > 0)
                    return ep;
            }
            else {
                /* The compare did major nasty stuff to the
                 * dict:  start over.
                 * XXX A clever adversary could prevent this
                 * XXX from terminating.
                 */
                return lookdict(mp, key, hash);
            }
        }
        else if (ep->me_key == dummy && freeslot == NULL)
            freeslot = ep;
    }
    assert(0);          /* NOT REACHED */
    return 0;
}

上面的逻辑如下:

1. 根据hash，key获取到探测链上的第一个entry/slot
1. 要查找的key对应的entry没有被使用过，返回这个没有使用的entry。
1. 要查找的key对应的entryentry使用者，并且内存地址一样，返回。用于更新键值对
1. 当entry是遗弃状态，设置一个freeslot
1. 要查找的key对应的entry的hash值是否一样，key内容是否一样，是一样的那么返回entry。
1. 当key和第一次探测的不一样的时候，发生冲突。说明要查找的key对应的entry已经被占用了，就会走后面的逻辑。进行循环探测。

一个PyDictObject存放着字典的信息，以及每一个键值对slot/entry。
现在，来看看Python内部的代码连同一个例子。

字典的C结构

在C代码实现中，结构体PyDictEntry使用来存放字典entry的，entry就是键值对。hash，key，value都会被保存。PyObject是Python的基类，所以Python的字典能存储任何对象。

typedef struct {
    /* Cached hash code of me_key.  Note that hash codes are C longs.
     * We have to use Py_ssize_t instead because dict_popitem() abuses
     * me_hash to hold a search finger.
     */
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

下面的结构体代表一个python的字典。ma_fill 是已经使用(active)了的槽和dummy槽的和。一个槽被设置为dummy态当键值对被一处的时候。ma_used是active槽的数量，ma_mask等于hash表的大小-1,主要用来做掩码操作。ma_tabel指向的内存空间为hash表的空间，初始化时会指向一个长度为8的数组(提高效率)。

typedef struct _dictobject PyDictObject;
struct _dictobject {
    PyObject_HEAD
    Py_ssize_t ma_fill;  /* # Active + # Dummy */
    Py_ssize_t ma_used;  /* # Active */

    /* The table contains ma_mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t ma_mask;

    /* ma_table points to ma_smalltable for small tables, else to
     * additional malloc'ed memory.  ma_table is never NULL!  This rule
     * saves repeated runtime null-tests in the workhorse getitem and
     * setitem calls.
     */
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];
};

字典初始化

当你第一次创建字典，函数PyDict_New()会调用。我删掉了一些行数，把C转化为了伪代码来集中在几个关键概念。

返回新的字典对象
function PyDict_New:
  分配一个新的字典对象
  清空字典的table
  设置字典的ma_fill数为0
  设置字典的ma_used为0
  设置字典的ma_mask为size-1=7
  设置字典的查找方法lookdict_string
  ma_table指向ma_smalltable
  返回 字典对象

添加Items

当一个新的键值对被添加,PyDict_SetItem()会调用，这个函数接受一个字典对象和key,value值。它会检查key，如果传入来key是string，并且会计算string的hash，或者复用。insertdict()用来添加键值对，当字典的已经使用的槽和dummy槽加起来的数量比值超过整个数组的2/3,那么字典会被重新分配。
为什么是2/3? 这是为了探测链能够快速的找到。
我们会在后面看重新分配字典大小是怎么实现的。

参数: dictionary, key, value
返回: 0 if ok or -1
function PyDict_SetItem:
  if key's hash is cached:
    use cached
  else:
   calc hash
  调用insertdict方法插入键值对到字典
  如果超出2/3:调用dictresize

insertdict()会调用lookdict_string() /lookdict ()来找到空闲的槽。跟查找key的函数类似。lookdict_string()会通过hash和mask值来计算可用槽的索引。如果在index = hash & mask找不到可用的槽，会开始循环探测，后面会解释，直到找到可用的槽。在第一次尝试探测中，如果key是null，他会返回第一次查找到的dummy槽，为了更高等级利用之前删除的槽。

我们想要天际一下几个键值对，{‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}，下面解释了发生了什么:

一个字典结果被分配，内部的table大小为8.

PyDict_SetItem: key = ‘a’, value = 1
- hash = hash(‘a’) = 12416037344
- insertdict
  - lookdict_string
    - slot index = hash & mask = 12416037344 & 7 = 0
    - slot 0 没有使用，所以直接返回
  - init entry at index 0 with key, value and hash
  - ma_used = 1, ma_fill = 1
PyDict_SetItem: key = ‘b’, value = 2
- hash = hash(‘b’) = 12544037731
- insertdict
  - lookdict_string
    - slot index = hash & mask = 12544037731 & 7 = 3
    - slot 3 is not used so return it
  - init entry at index 3 with key, value and hash
  - ma_used = 2, ma_fill = 2
PyDict_SetItem: key = ‘z’, value = 26
- hash = hash(‘z’) = 15616046971
- insertdict
  - lookdict_string
    - slot index = hash & mask = 15616046971 & 7 = 3
    - slot 3 is used so probe for a different slot: 5 is free
  - init entry at index 5 with key, value and hash
  - ma_used = 3, ma_fill = 3
PyDict_SetItem: key = ‘y’, value = 25
- hash = hash(‘y’) = 15488046584
- insertdict
  - lookdict_string
    - slot index = hash & mask = 15488046584 & 7 = 0
    - slot 0 is used so probe for a different slot: 1 is free
  - init entry at index 1 with key, value and hash
  - ma_used = 4, ma_fill = 4
PyDict_SetItem: key = ‘c’, value = 3
- hash = hash(‘c’) = 12672038114
- insertdict
  - lookdict_string
    - slot index = hash & mask = 12672038114 & 7 = 2
    - slot 2 is free so return it
  - init entry at index 2 with key, value and hash
  - ma_used = 5, ma_fill = 5
PyDict_SetItem: key = ‘x’, value = 24
- hash = hash(‘x’) = 15360046201
- insertdict
  - lookdict_string
    - slot index = hash & mask = 15360046201 & 7 = 1
    - slot 1 is used so probe for a different slot: 7 is free
  - init entry at index 7 with key, value and hash
  - ma_used = 6, ma_fill = 6

目前为止，我们做了如下：

《Python字典实现》 insert.png

8个槽用了6个，超过了数组容量的2/3,dictresize() 将会调用，分配一个更大的数组。这个函数还会维护旧的数据到新的数组。

dictresize() 调用会使minused=24,在我们刚才的情况中，是4 * ma_used。2 * ma_used 是当数据量非常大的时候会使用(大于50000),为什么是4倍？它减少resize步骤以及增加可用的槽。

新表是大于24的，是通过将当前大小向左移1位直到大于24来获得的。它最终变为32。e.g. 8(b1000)->16(b10000)->32(b100000)

当重新调整大小时候:一个 32大小的table被分配内存。旧的table entry会插入到新表中(使用新的mask值，31)。我们得到如下：

《Python字典实现》 resizing.png

删除Items

PyDict_DelItem()用来删除字典的entry。key的hash会被计算，并且会给查找函数作为参数，查找函数会返回要找到的entry。删除之后slot是dummy slot。

我们想删除key ‘c’ 从我们的字典中，最后，会得到下面的数组。

《Python字典实现》 delete.png

注意，删除操作不会出发重新分配数组的操作，尽管使用了的槽用的很少。然而，当添加键值对的时候，需要调整大小是基于使用的插槽数+虚拟插槽，因此它也可以缩小数组。

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback.

    原文作者：Baloneo
    原文地址: https://www.jianshu.com/p/b30fb4804f15
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。