python – 使用pandas在90天之前查找现有客户

2019年7月20日 281次阅读

尝试在数据框架中建立新客户与现有客户,“现有”意味着它们在订单当天之前90天内存在于数据框中.试图找到最好的熊猫方式来做到这一点 – 目前我正在掩盖根据日期然后看系列：

from datetime import datetime, timedelta


def is_existing(row):
    mask = (df_only_90_days['placed_at'] <= (row['placed_at'] + timedelta(-1)).date())
    return row['customer_id'] in df_only_90_days.loc[mask]['customer_id']


df_only_90_days.apply(is_existing, axis=1)

只有几千条记录就可以了,但是一旦进入数百万条记录,它就太慢了.道歉,也是熊猫的新手.有什么想法吗？

最佳答案您可以根据customer_id使用pandas
groupby功能,然后您可以单独查看每个组.

假设您的数据框如下所示：

   customer_id                  placed_at
0            1 2016-11-17 19:16:35.635774
1            2 2016-11-17 19:16:35.635774
2            3 2016-11-17 19:16:35.635774
3            4 2016-11-17 19:16:35.635774
4            5 2016-11-17 19:16:35.635774
5            5 2016-07-07 00:00:00.000000

客户5提前90天存在.但其他客户都没有.使用groupby,我们可以创建groupby对象,其中每个组包含具有特定customer_id的所有行.我们为您的数据框中的每个唯一customer_id获取一个组.当我们将函数应用于此groupby对象时,它将应用于每个组.

 groups = df.groupby("customer_id")

然后我们可以定义一个函数来检查给定的组,看看该客户是否存在于90天之前.

 def existedBefore(g):
    # if the difference between the max and min placed_at values is less than 90 days
     # then return False.  Otherwise, return True
     # if the group only has 1 row, then max and min are the same
     # so this check still works
     if g.placed_at.max() - g.placed_at.min() >= datetime.timedelta(90):
         return True

     return False

现在,如果我们运行：

groups.apply(existedBefore)

我们得到：

customer_id
1    False
2    False
3    False
4    False
5     True

所以我们可以看到客户5以前存在过.

此解决方案的性能取决于您拥有多少独特客户.有关应用性能的更多信息,请参阅此链接以了解groupby：Pandas groupby apply performing slow

矢量化解决方案

如果您只是寻找在今天之前至少90天注册的所有用户,那么您可以采用矢量化方法而不是依赖于应用.

 import datetime
 priors = df[datetime.datetime.now() - df.placed_at >= timedelta(90)]

先生将看起来像这样：

   customer_id  placed_at
5            5 2016-07-07

因此,我们发现客户5在今天之前90天就存在了.您的原始解决方案与此非常接近,问题是对大型数据帧的应用速度很慢. There are ways to improve that performance但这种矢量化方法应该能满足您的需求.