我“发现”我可以使用
Python对象创建一个pandas.Index,只要对象实现了,事情似乎就可以正常工作:__ hash __,__ eq __,__ ne _,_ _ _ _ _ _.这样做会有性能影响吗?例如.我会像使用字符串或整数索引标签那样快速排序和选择工作吗?这种指数是否得到了很好的支持?是否有关于如何正确执行此操作的文档?
这是一个例子:
class MyObject(object):
def __init__(self, name):
self.name = name # Expect name is a string
self.complicated_object = lambda x: 2 * x
def __hash__(self):
# Allows indexing frames by name rather than question
return hash(self.name)
def __str__(self):
# Makes sure DataFrames print nicely
return self.name
def __eq__(self, other):
# Allows indexing frames by name rather than question
if isinstance(other, basestring):
return self.name == other
else:
return self.name == other.name
my_series = pd.Series([1, 2], index=[MyObject('cat'), MyObject('dog')])
print my_series
my_series.index[0]
这打印
猫1
狗2
dtype:int64
< __ main __.MyObject at 0x81a67d0>
最佳答案 简而言之:是的,排序会受到性能影响.这是一个测试用例:
n = 10000
idx = np.random.permutation(n)
data = np.arange(n)
obj_idx = [MyObject(str(ii)) for ii in idx]
str_idx = [str(ii) for ii in idx]
int_idx = idx.tolist()
s1 = pd.Series(data, obj_idx)
s2 = pd.Series(data, str_idx)
s3 = pd.Series(data, int_idx)
排序时间:
In [1]: %%timeit s = s1.copy()
s.sort_index()
....:
10 loops, best of 3: 47.6 ms per loop
In [2]: %%timeit s = s2.copy()
s.sort_index()
....:
100 loops, best of 3: 6.63 ms per loop
In [3]: %%timeit s = s3.copy()
s.sort_index()
....:
1000 loops, best of 3: 794 µs per loop