python – pandas中的对象索引：性能？好受支持？

2019年7月22日 157次阅读

我“发现”我可以使用
Python对象创建一个pandas.Index,只要对象实现了,事情似乎就可以正常工作：__ hash __,__ eq __,__ ne _,_ _ _ _ _ _.这样做会有性能影响吗？例如.我会像使用字符串或整数索引标签那样快速排序和选择工作吗？这种指数是否得到了很好的支持？是否有关于如何正确执行此操作的文档？

这是一个例子：

class MyObject(object):
  def __init__(self, name):
    self.name = name  # Expect name is a string
    self.complicated_object = lambda x: 2 * x

  def __hash__(self):
    # Allows indexing frames by name rather than question
    return hash(self.name)

  def __str__(self):
    # Makes sure DataFrames print nicely
    return self.name

  def __eq__(self, other):
    # Allows indexing frames by name rather than question
    if isinstance(other, basestring):
      return self.name == other
    else:
      return self.name == other.name

my_series = pd.Series([1, 2], index=[MyObject('cat'), MyObject('dog')])

print my_series

my_series.index[0]

这打印

猫1
狗2
dtype：int64
< __ main __.MyObject at 0x81a67d0>

最佳答案简而言之：是的,排序会受到性能影响.这是一个测试用例：

n = 10000
idx = np.random.permutation(n)
data = np.arange(n)
obj_idx = [MyObject(str(ii)) for ii in idx]
str_idx = [str(ii) for ii in idx]
int_idx = idx.tolist()

s1 = pd.Series(data, obj_idx)
s2 = pd.Series(data, str_idx)
s3 = pd.Series(data, int_idx)

排序时间：

In [1]: %%timeit s = s1.copy()
s.sort_index()
   ....: 
10 loops, best of 3: 47.6 ms per loop

In [2]: %%timeit s = s2.copy()
s.sort_index()
   ....: 
100 loops, best of 3: 6.63 ms per loop

In [3]: %%timeit s = s3.copy()
s.sort_index()
   ....: 
1000 loops, best of 3: 794 µs per loop