dataframe之按时间筛选数据

对原始数据集进行筛选操作,条件是:客户号、queryreason、querier三个相同的时候,开始计算querydate按从大到小排序,14内出现的删除

功能描述

对数据集中具有相同key的记录,按照时间字段从大到小进行排序,然后从最大时间开始,删除间隔天数为14天以内的记录

原始样例数据如下:

    key  query_date
0   001  2020-07-01
1   002  2020-07-02
2   001  2020-07-04
3   001  2020-07-05
4   001  2020-07-06
5   001  2020-07-19
6   002  2020-07-12
7   002  2020-07-09
8   002  2020-07-23
9   002  2020-07-18
10  002  2020-07-29
11  001  2020-07-27

分组排序后的数据如下:

    key  query_date
11  001  2020-07-27
5   001  2020-07-19
4   001  2020-07-06
3   001  2020-07-05
2   001  2020-07-04
0   001  2020-07-01
10  002  2020-07-29
8   002  2020-07-23
9   002  2020-07-18
6   002  2020-07-12
7   002  2020-07-09
1   002  2020-07-02

对于001,最大日期是2020-07-27,001中2020-07-19的记录相差小于14天,故将该记录删除

筛选后的结果:

    key  query_date
11  001  2020-07-27
4   001  2020-07-06
10  002  2020-07-29
9   002  2020-07-18
7   002  2020-07-09

代码实现:

"""
对df做如下操作:

1、对df根据查询机构、原因、客户号生成一个key
2、按照key进行分租并对组内按照querydate倒序排列, 由于分组不好操作,这里直接受用二级排序
"""

import time
import datetime
import pandas as pd


def date_diff(date1: object, date2: object) -> object:
    date1 = time.strptime(date1, "%Y-%m-%d")
    date2 = time.strptime(date2, "%Y-%m-%d")
    date1 = datetime.datetime(date1[0], date1[1], date1[2])
    date2 = datetime.datetime(date2[0], date2[1], date2[2])
    return (date2 - date1).days


def dataframe_filter(df: pd.DataFrame, key_col: str, date_col: str, max_adjacent_day: int):
    df = df.sort_values([key_col, date_col], ascending=[1, 0])

    curr_date = None
    curr_key = None
    for index, row in df.iterrows():
        if curr_key != row['key']:
            curr_key = row['key']
            curr_date = row['query_date']
            continue

        diff = date_diff(row['query_date'], curr_date)
        if diff <= max_adjacent_day:
            df.loc[index]['query_date'] = None
        else:
            curr_date = row['query_date']

    return df.dropna()


if __name__ == '__main__':
    test_dict = {'key': ["001", "002", "001", "001",
                         "001", "001", "002", "002",
                         "002", "002", "002", "001"],
                 'query_date': ['2020-07-01', '2020-07-02', '2020-07-04', '2020-07-05',
                          '2020-07-06', '2020-07-19', '2020-07-12', '2020-07-09',
                          '2020-07-23', '2020-07-18', '2020-07-29', '2020-07-27'
                          ]}

    test_df = pd.DataFrame.from_dict(test_dict)
    print(test_df)
    print("-"*30)

    result_df = dataframe_filter(test_df, "key", "query_date", 8)

    print("筛选的结果: ")
    print(result_df)
    原文作者:AISeekOnline
    原文地址: https://blog.csdn.net/qq_28743951/article/details/107358057
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞