我有一个大熊猫的数据框,我想通过电子邮件分组,获取日期的最大值并保留状态列.但是在groupby上没有使用状态.
示例:给出以下数据帧df
+-------------------------------+
| email | status | date |
+-------------------------------+
| test1 | viewed | 01/07/18 |
---------------------------------
| test1 |not viewed| 03/07/18 |
---------------------------------
| test2 |not viewed| 02/07/18 |
---------------------------------
| test2 | viewed | 01/07/18 |
---------------------------------
| test3 |not viewed| 03/07/18 |
---------------------------------
| test3 | viewed | 04/07/18 |
---------------------------------
我使用以下代码,但我想保留状态列,但我不知道如何.
df.groupby([email]).aggregate({'date': max})
期望的输出:
+-------------------------------+
| email | status | date |
+-------------------------------+
| test1 |not viewed| 03/07/18 |
---------------------------------
| test2 |not viewed| 02/07/18 |
---------------------------------
| test3 | viewed | 04/07/18 |
---------------------------------
总而言之,我希望通过电子邮件进行分组,获取最新日期并保留状态列
最佳答案 而不是agg您可以按日期排序,使用groupby,并选择最后一个(这将是最新的):
df['date'] = pd.to_datetime(df.date)
df.sort_values('date').groupby('email', as_index=False).last()
email status date
0 test1 not viewed 2018-03-07
1 test2 not viewed 2018-02-07
2 test3 viewed 2018-04-07