对于循环 – 从日期列中选择时间窗口

2024年2月5日 230次阅读

我需要调整一个代码,它与我的数据框完美配合(但是有另一个设置),以便从列Day中选择一个2天的时间窗口.特别是我对前一天的0天感兴趣(即i – 1和i,其中i是感兴趣的日子)并且列中包含的其(i – 1)值必须被添加到第0天(i col col.

这是我的数据帧的一个例子：

df <- read.table(text = "
        Station   Day           Count
    1    33012  12448               4
    2    35004  12448               4
    3    35008  12448               4
    4    37006  12448               4
    5    21009   4835               3
    6    24005   4835               3
    7    27001   4835               3
    8    25005  12447               3
    9    29001  12447               3
    10   29002  12447               3
    11   29002  12446               3
    12   30001  12446               3
    13   31002  12446               3
    14   47007   4834               2
    15   49002   4834               2
    16   47004  12445               1
    17   51001  12449               1
    18   51003   4832               1
    19   52004   4836               1", header = TRUE)

我的输出应该是：

           Station    Day           Count
        1    33012  12448               7
        2    35004  12448               7
        3    35008  12448               7
        4    37006  12448               7
        5    21009   4835               5
        6    24005   4835               5
        7    27001   4835               5
        8    29002  12446               4
        9    30001  12446               4
        10   31002  12446               4
        11   51001  12449               1
        12   51003   4832               1
        13   52004   4836               1
        14   25005  12447               0
        15   29001  12447               0
        16   29002  12447               0
        17   47007   4834               0
        18   49002   4834               0
        19   47004  12445               0

我正在尝试此代码,但它不适用于我的真实数据帧：

for (i in unique(df$Day)) {
    temp <- df$Count[df$Day == i]  
    if(length(temp > 0)) {  
    condition1 <- df$Day == i - 1   
    if (any(condition1)) {
       df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i]
       df$Count[condition1] <- 0
            }
         }
}

代码似乎是正确的,它有意义,但我的输出不是.

谁能帮助我？

@aichao代码效果很好.

如果我想考虑前30天(即第30天,第29天,第28天,……,第1天,第0天)有任何快速的方法来做,而不是创建30 if语句(条件)？

再次感谢@aichao的帮助.

最佳答案以下是您对所提供的示例数据执行的操作

for (i in unique(df$Day)) {
  temp <- df$Count[df$Day == i]
  if (any(temp > 0)) {
    condition1 <- df$Day == i - 1
    condition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE
    if (any(condition1)) {
      df$Count[df$Day == i] <- mean(df$Count[condition1]) + df$Count[df$Day == i]
      df$Count[condition1] <- 0
    }
  }
}
print(df[order(df$Count, decreasing = TRUE),])
##   Station   Day Count
##1    33012 12448     7
##2    35004 12448     7
##3    35008 12448     7
##4    37006 12448     7
##5    21009  4835     5
##6    24005  4835     5
##7    27001  4835     5
##11   29002 12446     4
##12   30001 12446     4
##13   31002 12446     4
##17   51001 12449     1
##18   51003  4832     1
##19   52004  4836     1
##8    25005 12447     0
##9    29001 12447     0
##10   29002 12447     0
##14   47007  4834     0
##15   49002  4834     0
##16   47004 12445     0

从您的实施中遗漏的评论中发现的关键要求是,在确定前一天及其计数时,仅考虑数据框下方(行中)的几天.也就是说,您正在处理数据框行,就像它们是按时排序一样,而不是将Day列中的值视为时间顺序.因此,对于df $Day = 12449,没有前一天需要考虑,因为df $Day = 12448的所有行都在它之前.因此,df $Day = 12449的计数保持为1,更重要的是,在处理df $Day = 12449之后,具有df $Day = 12448的所有行的计数不会被清零.

为了实现这一点,我们需要进一步过滤condition1,以便我们将所有行设置为FALSE,其中df $Day == i – 1(前一天)在df $Day == i(感兴趣的日期)之前的最高行之前使用这条线

condition1[which(df$Day == i - 1) < max(which(df$Day == i))] <- FALSE

请注意,此解决方案假定数据框中Day列的相同值作为样本数据中的行块集中在一起.否则,您的for循环over unique(df $Day)需要完全重新考虑并替换为行上的循环,以便跟踪数据框中感兴趣的当天行.

此外,您的代码中存在一个小错误

if(length(temp > 0)) {

目的是检查在感兴趣的日子里Count是否大于0的行.然而,R中的条件运算符被矢量化,使得temp> 1. 0返回一个与其输入temp相同长度的布尔矢量.因此,长度(temp> 0)将始终返回正数,除非temp本身具有长度0(即,空).要获得您的意图,该行将更改为

if(any(temp > 0)) {

更新：有关前几天的新要求

解决新要求的最简单方法是将if(any(temp> 0)){…}块中的代码体放入函数中,将其称为accumulate.mean.count,并将此函数应用于使用sapply的前几天的集合.修改是：

accumulate.mean.count <- function(this.day, lag) {
  condition1 <- df$Day == this.day - lag
  condition1[which(df$Day == this.day - lag) < max(which(df$Day == this.day))] <- FALSE
  if (any(condition1)) {
    df$Count[df$Day == this.day] <<- mean(df$Count[condition1]) + df$Count[df$Day == this.day]
    df$Count[condition1] <<- 0
  }
}

lags <- seq_len(30)

for (i in unique(df$Day)) {
  temp <- df$Count[df$Day == i]
  if (any(temp > 0)) {
    sapply(lags, accumulate.mean.count, this.day=i)
  }
}

print(df[order(df$Count, decreasing = TRUE),])

笔记：

> lag是当前日期之前(即,滞后)的天数.滞后= 1表示前一天,滞后= 2表示前两天,等等滞后是这些的集合.这里,滞后< -seq_len(30)是一个从1到30的序列,在其上应用了accumulate.mean.count,这就是你想要的.有关* apply系列R函数的完美概述,请参见this.请注意,滞后不一定是序列,而只是前一天,前5天和前10天的整数集合,如c(1,5,10).如果你想在未来几天滚动,它甚至不必是积极的,但不应该是零.
>由于lexical scoping rule of R,在函数accumulate.mean.count中设置df $Count,这是一个超出accumulate.mean.count范围的变量,需要<< – 而不是< – .有关说明,请参阅this,并注意使用<< – 在那里提到的危险.
我没有足够的数据来测试滞后< – seq_len(30),但是对于seq_len(1),我恢复了原始结果,而对于seq_len(2),我得到了

##   Station   Day Count
##1    33012 12448    10
##2    35004 12448    10
##3    35008 12448    10
##4    37006 12448    10
##5    21009  4835     5
##6    24005  4835     5
##7    27001  4835     5
##16   47004 12445     1
##17   51001 12449     1
##18   51003  4832     1
##19   52004  4836     1
##8    25005 12447     0
##9    29001 12447     0
##10   29002 12447     0
##11   29002 12446     0
##12   30001 12446     0
##13   31002 12446     0
##14   47007  4834     0
##15   49002  4834     0

我认为这就是你想要的.