对原始数据集进行筛选操作,条件是:客户号、queryreason、querier三个相同的时候,开始计算querydate按从大到小排序,14内出现的删除
功能描述
对数据集中具有相同key的记录,按照时间字段从大到小进行排序,然后从最大时间开始,删除间隔天数为14天以内的记录
原始样例数据如下:
key query_date
0 001 2020-07-01
1 002 2020-07-02
2 001 2020-07-04
3 001 2020-07-05
4 001 2020-07-06
5 001 2020-07-19
6 002 2020-07-12
7 002 2020-07-09
8 002 2020-07-23
9 002 2020-07-18
10 002 2020-07-29
11 001 2020-07-27
分组排序后的数据如下:
key query_date
11 001 2020-07-27
5 001 2020-07-19
4 001 2020-07-06
3 001 2020-07-05
2 001 2020-07-04
0 001 2020-07-01
10 002 2020-07-29
8 002 2020-07-23
9 002 2020-07-18
6 002 2020-07-12
7 002 2020-07-09
1 002 2020-07-02
对于001,最大日期是2020-07-27,001中2020-07-19的记录相差小于14天,故将该记录删除
筛选后的结果:
key query_date
11 001 2020-07-27
4 001 2020-07-06
10 002 2020-07-29
9 002 2020-07-18
7 002 2020-07-09
代码实现:
"""
对df做如下操作:
1、对df根据查询机构、原因、客户号生成一个key
2、按照key进行分租并对组内按照querydate倒序排列, 由于分组不好操作,这里直接受用二级排序
"""
import time
import datetime
import pandas as pd
def date_diff(date1: object, date2: object) -> object:
date1 = time.strptime(date1, "%Y-%m-%d")
date2 = time.strptime(date2, "%Y-%m-%d")
date1 = datetime.datetime(date1[0], date1[1], date1[2])
date2 = datetime.datetime(date2[0], date2[1], date2[2])
return (date2 - date1).days
def dataframe_filter(df: pd.DataFrame, key_col: str, date_col: str, max_adjacent_day: int):
df = df.sort_values([key_col, date_col], ascending=[1, 0])
curr_date = None
curr_key = None
for index, row in df.iterrows():
if curr_key != row['key']:
curr_key = row['key']
curr_date = row['query_date']
continue
diff = date_diff(row['query_date'], curr_date)
if diff <= max_adjacent_day:
df.loc[index]['query_date'] = None
else:
curr_date = row['query_date']
return df.dropna()
if __name__ == '__main__':
test_dict = {'key': ["001", "002", "001", "001",
"001", "001", "002", "002",
"002", "002", "002", "001"],
'query_date': ['2020-07-01', '2020-07-02', '2020-07-04', '2020-07-05',
'2020-07-06', '2020-07-19', '2020-07-12', '2020-07-09',
'2020-07-23', '2020-07-18', '2020-07-29', '2020-07-27'
]}
test_df = pd.DataFrame.from_dict(test_dict)
print(test_df)
print("-"*30)
result_df = dataframe_filter(test_df, "key", "query_date", 8)
print("筛选的结果: ")
print(result_df)