Warm tip: This article is reproduced from stackoverflow.com, please click
grouping id merge r reduce

Merge rows with the same ID but with overlapping variables

发布于 2020-04-15 10:08:25

I have data in r that has over 6000 observations and 96 variables.

The data relates to groups of individuals and their activities etc. If a group returned the Group ID number was recorded again and a new observation was made. I need to merge the rows by ID so that the # of individuals take the highest number recorded, but the activities etc are a combination of both observations.

The data contains, #of individuals, activities, impacts, time of arrival etc. The issue is that some of the observations were split across 2 lines, so there may be activities which were recorded for the same group in another line. The Group ID for both observations is the same, but one may have the #of individuals recorded and some activity records or impacts, but the second may be incomplete and only have Group ID and then Impacts (which are additional to those in the 1st record). The #of individuals in the group never changes, so I need some way to combine them so that activities are additive, but #visitors takes the value that is highest, time of arrival needs to be the earliest recorded and time of departure needs to be the later of the 2 observations.

Does anyone know how to merge observations based on Group ID but vary the merging protocol based on the variable.

enter image description here

Questioner
Andrew Torsney
Viewed
51
Em Laskey 2020-02-04 21:58

I'm not sure if this actually is what you want, but to combine rows of a data frame based on multiple conditions you can use the dplyr package and its summarise()function. I generated some data to use in R directly, you would have to modify the code according to your needs.

# generate data
ID<-rep(1:20,2)
visitors<-sample(1:50, 40, replace=TRUE)
impact<-sample(rep(c("a", "b", "c", "d", "e"), 8))
arrival<-sample(rep(8:15, 5))
departure <- sample(rep(16:23, 5))

df<-data.frame(ID, visitors, impact, arrival, departure)
df$impact<-as.character(df$impact)

# summarise rows with identical ID
df_summary <- df %>%
  group_by(ID) %>%
  summarise(visitors = max(visitors), arrival = min(arrival), 
            departure = max(departure), impact = paste0(impact, collapse =", "))

Hope this helps!