我有一个大型数据框,从中我可以获得groupby所需的数据.我需要从新数据帧的索引中获取几个单独的列.
部分原始数据框如下所示:
code place vl year week
0 111.0002.0056 region1 1 2017 29
1 112.6500.2285 region2 1 2017 31
2 112.5600.6325 region2 1 2017 30
3 112.5600.6325 region2 1 2017 30
4 112.5600.8159 region2 1 2017 30
5 111.0002.0056 region2 1 2017 29
6 111.0002.0056 region2 1 2017 30
7 111.0002.0056 region2 1 2017 28
8 112.5600.8159 region3 1 2017 31
9 112.5600.8159 region3 1 2017 28
10 111.0002.0114 region3 1 2017 31
....
应用groupby后,它看起来像这样(代码:df_test1 = df_test.groupby([‘code’,’year’,’week’,’place’])[‘vl’].sum().unstack(fill_value = 0 )):
place region1 region2 region3 region4 index1
code year week
111.0002.0006 2017 29 0 3 0 0 (111.0002.0006, 2017, 29)
30 0 7 0 0 (111.0002.0006, 2017, 30)
111.0002.0018 2017 29 0 0 0 0 (111.0002.0018, 2017, 29)
111.0002.0029 2017 30 0 0 0 0 (111.0002.0029, 2017, 30)
111.0002.0055 2017 28 0 33 0 8 (111.0002.0055, 2017, 28)
29 1 155 2 41 (111.0002.0055, 2017, 29)
30 0 142 1 39 (111.0002.0055, 2017, 30)
31 0 31 0 13 (111.0002.0055, 2017, 31)
111.0002.0056 2017 28 9 36 0 4 (111.0002.0056, 2017, 28)
29 20 75 2 37 (111.0002.0056, 2017, 29)
30 17 81 2 33 (111.0002.0056, 2017, 30)
....
我将索引保存在单独的列index1中(代码:df_test1 [‘index1’] = df_test1.index)
我需要退出列index1三个单独的列代码,年和周.
结果应如下所示:
region1 region2 region3 region4 code year week
0 3 0 0 111.0002.0006 2017 29
0 7 0 0 111.0002.0006 2017 30
0 0 0 0 111.0002.0018 2017 29
0 0 0 0 111.0002.0029 2017 30
0 33 0 8 111.0002.0055 2017 28
1 155 2 41 111.0002.0055 2017 29
0 142 1 39 111.0002.0055 2017 30
0 31 0 13 111.0002.0055 2017 31
....
我会很感激任何建议!
最佳答案 你添加
reset_index
而不是df_test1 [‘index1’] = df_test1.index,对于clean df add
rename_axis
– 它删除列名称:
df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum() \
.unstack(fill_value=0) \
.reset_index() \
.rename_axis(None, axis=1)
print (df_test1)
code year week region1 region2 region3
0 111.0002.0056 2017 28 0 1 0
1 111.0002.0056 2017 29 1 1 0
2 111.0002.0056 2017 30 0 1 0
3 111.0002.0114 2017 31 0 0 1
4 112.5600.6325 2017 30 0 2 0
5 112.5600.8159 2017 28 0 0 1
6 112.5600.8159 2017 30 0 1 0
7 112.5600.8159 2017 31 0 0 1
8 112.6500.2285 2017 31 0 1 0
如有必要,最后更改列的排序:
#all cols are columns in df_test1
cols = ['code' , 'year', 'week']
df_test1 = df_test1[[x for x in df_test1.columns if x not in cols] + cols]
print (df_test1)
region1 region2 region3 code year week
0 0 1 0 111.0002.0056 2017 28
1 1 1 0 111.0002.0056 2017 29
2 0 1 0 111.0002.0056 2017 30
3 0 0 1 111.0002.0114 2017 31
4 0 2 0 112.5600.6325 2017 30
5 0 0 1 112.5600.8159 2017 28
6 0 1 0 112.5600.8159 2017 30
7 0 0 1 112.5600.8159 2017 31
8 0 1 0 112.6500.2285 2017 31