Effcient shifting in R data.table with missing values

发布于 2020-12-01 15:16:14

I have a data.table with the following three variables

receptionist = id of the reception person
week.no = this week's number (1 means first week)
absent.thisweek = number of days absent this week

Note that we only have data for the weeks receptionist has come for work, if a week is missing means that person was absent that week

For instance, we have two receptionsit 1 and 2 with this work pattern:

dt <- data.table(receptionist =  c(1,1,1,2,2,2,2), week.no = c(1,3,4,5,8,10,11), absent.thisweek = c(1,2,3,4,5,6,6))

>dt receptionist week.no absent.thisweek
1:            1       1               1
2:            1       3               2
3:            1       4               3
4:            2       5               4
5:            2       8               5
6:            2      10               6
7:            2      11               6

Step 1: I need to find out the number of days a receptionist is absent next week, for this I shift the absent.thisweek by one (week) if that information is available in the data, note for the weeks recepetionist has not come, this value will be NA, my dataset has 1 million rows and this is the most inefficient part.

dt[order(receptionist, week.no), absent.nextweek := dt$absent.thisweek[dt$receptionist==receptionist & dt$week.no==week.no+1], by = .(receptionist, week.no)]

>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1              NA
2:            1       3               2               3
3:            1       4               3              NA
4:            2       5               4              NA
5:            2       8               5              NA
6:            2      10               6               6
7:            2      11               6              NA

Step 2: If data is missing it has to be 7 days of absent. For the last week, we will have the absent.nextweek = NA

dt[is.na(absent.nextweek) & week.no != max(week.no, na.rm=T), absent.nextweek:=7]
   
>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1               7
2:            1       3               2               3
3:            1       4               3               7
4:            2       5               4               7
5:            2       8               5               7
6:            2      10               6               6
7:            2      11               6              NA

I am facing two problems: -In Step 1, it is very inefficient and taking long time (about an hr) to run -In Step 2, row 3 should have NA in the last column, which is not the case

Any suggestion to improve the efficiency (preferably data.table) will be very helpful as well as pointing out the mistake in step 2.

The correct answer would be:

>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1               7
2:            1       3               2               3
3:            1       4               3               NA
4:            2       5               4               7
5:            2       8               5               7
6:            2      10               6               6
7:            2      11               6              NA

Questioner

vivek

Viewed

Original

# We assume data is ordered by week. Otherwise you can run # setorder(dt, receptionist, week.no) dt[, absent.nextweek := fifelse(week.no+1L == shift(week.no, -1L), shift(absent.thisweek, -1L), 7), by = receptionist] # receptionist week.no absent.thisweek absent.nextweek # 1: 1 1 1 7 # 2: 1 3 2 3 # 3: 1 4 3 NA # 4: 2 5 4 7 # 5: 2 8 5 7 # 6: 2 10 6 6 # 7: 2 11 6 NA

vivek 2020-12-03 03:24:10

I checked the solution it works perfectly as well as efficiently. One more question that I wrote -- why is the max() not working and I am getting 7 in the column 3 using my code? @sindri_baldur

sindri_baldur 2020-12-03 08:01:58

I seem it is because you are referring to the max overall not by group.

vivek 2020-12-04 02:51:53

Thank you. I have accepted the answer. Btw, writing the code by = .(receptionist) does not change the outcome with max. @sindri_baldur

Effcient shifting in R data.table with missing values

热门帖子

热门github