假设我们有一个这样的列表,显示特定日期(mm-dd-yyyy-hour-minute)每个对象的数量:
A = [
[
['07-07-2012-21-04', 'orange', 1],
['08-16-2012-08-57', 'orange', 1],
['08-18-2012-03-30', 'orange', 1],
['08-18-2012-03-30', 'orange', 1],
['08-19-2012-03-58', 'orange', 1],
['08-19-2012-03-58', 'orange', 1],
['08-19-2012-04-09', 'orange', 1],
['08-19-2012-04-09', 'orange', 1],
['08-19-2012-05-21', 'orange', 1],
['08-19-2012-05-21', 'orange', 1],
['08-19-2012-06-03', 'orange', 1],
['08-19-2012-07-51', 'orange', 1],
['08-19-2012-08-17', 'orange', 1],
['08-19-2012-08-17', 'orange', 1]
],
[
['07-07-2012-21-04', 'banana', 1]
],
[
['07-07-2012-21-04', 'mango', 1],
['08-16-2012-08-57', 'mango', 1],
['08-18-2012-03-30', 'mango', 1],
['08-18-2012-03-30', 'mango', 1],
['08-19-2012-03-58', 'mango', 1],
['08-19-2012-03-58', 'mango', 1],
['08-19-2012-04-09', 'mango', 1],
['08-19-2012-04-09', 'mango', 1],
['08-19-2012-05-21', 'mango', 1],
['08-19-2012-05-21', 'mango', 1],
['08-19-2012-06-03', 'mango', 1],
['08-19-2012-07-51', 'mango', 1],
['08-19-2012-08-17', 'mango', 1],
['08-19-2012-08-17', 'mango', 1]
]
]
我在A中需要做的是填写每个对象的所有缺失日期(从最小日期到最大日期A),值为0.一旦缺少日期及其对应值(0),我想要求和为每个日期添加值,以便不重复日期 – 对于每个子列表.
现在,我想要的是:我将A的日期和值分开(在名为u和v的列表中)并将每个子列表转换为pandas Series,并将它们各自的索引分配给它们.所以对于每个zip(u,v):
def generate(values, indices):
indices = flatten(indices)
date_index = DatetimeIndex(indices)
ts = Series(values, index=date_index)
ts.reindex(date_range(min(date_index), max(date_index)))
return ts
但在这里,重新索引引起异常.我正在寻找的是纯粹的pythonic方式(没有pandas),完全基于列表理解或甚至是numpy数组.
还有一个小时聚合的问题,这意味着如果所有日期都相同且只有小时数不同,那么我想填写当天所有缺失的小时,然后在每小时重复相同的聚合过程,缺少的小时用0值填写.
提前致谢.
最佳答案 那这个呢:
from collections import defaultdict, OrderedDict
from datetime import datetime, timedelta
from itertools import chain, groupby
flat = sorted((datetime.strptime(d, '%m-%d-%Y-%H-%M').date(), f, c)
for (d, f, c) in chain(*A))
counts = [(d, f, sum(e[2] for e in l))
for (d, f), l
in groupby(flat, key=lambda t: (t[0], t[1]))]
# lets assume that there are some data
start = counts[0][0]
end = counts[-1][0]
result = OrderedDict((start+timedelta(days=i), defaultdict(int))
for i in range((end-start).days+1))
for day, data in groupby(counts, key=lambda d: d[0]):
result[day].update((f, c) for d, f, c in data)
我的问题是:我们真的需要填写不存在的日期 – 我很容易想象当这将是大量数据的情况,甚至是危险的数据量……我认为最好使用简单的通用函数和生成器,如果你想在某个地方列出它们:
from collections import defaultdict
from datetime import datetime, timedelta
from itertools import chain, groupby
def aggregate(data, resolution='daily'):
assert resolution in ['hourly', 'daily']
if resolution == 'hourly':
round_dt = lambda dt: dt.replace(minute=0, second=0, microsecond=0)
else:
round_dt = lambda dt: dt.date()
flat = sorted((round_dt(datetime.strptime(d, '%m-%d-%Y-%H-%M')), f, c)
for (d, f, c) in chain(*A))
counts = [(d, f, sum(e[2] for e in l))
for (d, f), l
in groupby(flat, key=lambda t: (t[0], t[1]))]
result = {}
for day, data in groupby(counts, key=lambda d: d[0]):
d = result[day] = defaultdict(int)
d.update((f, c) for d, f, c in data)
return result
def xaggregate(data, resolution='daily'):
aggregated = aggregate(data, resolution)
curr = min(aggregated.keys())
end = max(aggregated.keys())
interval = timedelta(days=1) if resolution == 'daily' else timedelta(seconds=3600)
while curr <= end:
# None is sensible value in case of missing data I think
yield curr, aggregated.get(curr)
curr += interval
一般来说,我的建议是你不应该使用列表作为有序结构(我的意思是[’07 -07-2012-21-04′,’mango’,1]).我认为元组更适合这个目的,当然collection.namedtuple更令人满意.