Warm tip: This article is reproduced from serverfault.com, please click

Effcient shifting in R data.table with missing values

发布于 2020-12-01 15:16:14

I have a data.table with the following three variables

  1. receptionist = id of the reception person
  2. week.no = this week's number (1 means first week)
  3. absent.thisweek = number of days absent this week

Note that we only have data for the weeks receptionist has come for work, if a week is missing means that person was absent that week

For instance, we have two receptionsit 1 and 2 with this work pattern:

dt <- data.table(receptionist =  c(1,1,1,2,2,2,2), week.no = c(1,3,4,5,8,10,11), absent.thisweek = c(1,2,3,4,5,6,6))
>dt receptionist week.no absent.thisweek
1:            1       1               1
2:            1       3               2
3:            1       4               3
4:            2       5               4
5:            2       8               5
6:            2      10               6
7:            2      11               6

Step 1: I need to find out the number of days a receptionist is absent next week, for this I shift the absent.thisweek by one (week) if that information is available in the data, note for the weeks recepetionist has not come, this value will be NA, my dataset has 1 million rows and this is the most inefficient part.

dt[order(receptionist, week.no), absent.nextweek := dt$absent.thisweek[dt$receptionist==receptionist & dt$week.no==week.no+1], by = .(receptionist, week.no)]
>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1              NA
2:            1       3               2               3
3:            1       4               3              NA
4:            2       5               4              NA
5:            2       8               5              NA
6:            2      10               6               6
7:            2      11               6              NA

Step 2: If data is missing it has to be 7 days of absent. For the last week, we will have the absent.nextweek = NA

dt[is.na(absent.nextweek) & week.no != max(week.no, na.rm=T), absent.nextweek:=7]
   
>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1               7
2:            1       3               2               3
3:            1       4               3               7
4:            2       5               4               7
5:            2       8               5               7
6:            2      10               6               6
7:            2      11               6              NA

I am facing two problems: -In Step 1, it is very inefficient and taking long time (about an hr) to run -In Step 2, row 3 should have NA in the last column, which is not the case

Any suggestion to improve the efficiency (preferably data.table) will be very helpful as well as pointing out the mistake in step 2.

The correct answer would be:

>dt receptionist week.no absent.thisweek absent.nextweek
1:            1       1               1               7
2:            1       3               2               3
3:            1       4               3               NA
4:            2       5               4               7
5:            2       8               5               7
6:            2      10               6               6
7:            2      11               6              NA
Questioner
vivek
Viewed
0
sindri_baldur 2020-12-02 16:54:39

Thinking about the problem from scratch, maybe you could do this directly with fifelse() and shift()?

# We assume data is ordered by week. Otherwise you can run 
# setorder(dt, receptionist, week.no)

dt[, 
   absent.nextweek := 
     fifelse(week.no+1L == shift(week.no, -1L),  shift(absent.thisweek, -1L), 7),
   by = receptionist]


#    receptionist week.no absent.thisweek absent.nextweek
# 1:            1       1               1               7
# 2:            1       3               2               3
# 3:            1       4               3              NA
# 4:            2       5               4               7
# 5:            2       8               5               7
# 6:            2      10               6               6
# 7:            2      11               6              NA