我有下面提到的数据帧:
DF <- read.table(text = "
Date ID
2018-04-01 K-1
2018-04-01 K-1
2018-04-01 K-8
2018-04-02 K-2
2018-04-02 K-2
2018-04-03 K-2
2018-04-03 K-2
2018-04-03 K-2
2018-04-04 K-3
2018-05-01 K-5
2018-05-01 K-5
2018-05-02 K-6
2018-05-02 K-7", header = TRUE, stringsAsFactors = FALSE)
通过使用上面提到的datafram,我想确定下面提到的指标:
Date Unique_count Duplicate_Count Overall_Duplicate
2018-04-01 2 1 0
2018-04-02 1 1 0
2018-04-03 0 0 3
2018-04-04 1 0 0
2018-05-01 1 1 0
2018-05-02 2 0 0
哪里:
> Unique_count-为特定日期创建的不同ID,ID不应与任何先前的ID匹配.
> Duplicate_count – 为特定日期生成的相同ID的附加计数(如果有2个K-1而不是 – Duplicate_count应为1),则相同的ID不应与任何先前的ID匹配.
> Overall_Duplicate – 先前生成并在特定日期再次出现的ID计数.
我有下面提到的代码,不确定Overall_Duplicate:
library(dplyr)
DF2 <- DF %>%
group_by(Date) %>%
summarise(Unique_Count = n_distinct(ID),
Duplicate_Count = sum(table(ID)>1))
最佳答案 如果您首先按ID分组并找到第一次出现每个ID,则可以将所有后续ID(在它们第一次出现之后)更改为NA,然后进行一些计算以获得所需内容.
DF %>%
group_by(ID) %>%
mutate(first_time = min(Date)) %>%
ungroup() %>%
mutate(ID = ifelse(Date == first_time, ID, NA)) %>%
group_by(Date) %>%
summarise(Unique_Count = n_distinct(ID, na.rm = TRUE),
Overall_Duplicate = sum(is.na(ID)),
Duplicate_Count = n() - Unique_Count - Overall_Duplicate)