Fill rows by default values if limit for data columns is defined

ALollz 2019-07-04 23:22

`get_dummies` + `interpolate`

This requires your columns to be sorted in time order, and for Start and Finish to ideally always exist in the column names.

df = df.set_index(['ID', 'Car', 'Start', 'Finish'])

s1 = (pd.get_dummies(df.index.get_level_values('Start'))
        .reindex(df.columns, axis=1)
        .replace(0, np.NaN))
s2 = (pd.get_dummies(df.index.get_level_values('Finish'))
        .reindex(df.columns, axis=1)
        .replace(0, np.NaN))

res = s1.combine_first(s2).interpolate(axis=1, limit_area='inside').fillna(0, downcast='infer')
res.index = df.index
res = res.reset_index()

Output `res`:

   ID       Car  Start Finish  Jan17  Jun18  Dec18  Apr19
0   0    Nissan  Jun18  Dec18      0      1      1      0
1   1   Porsche  Jan17  Apr19      1      1      1      1
2   2      Golf  Jun18  Apr19      0      1      1      1
3   3    Toyota  Jan17  Apr19      1      1      1      1
4   4     Mazda  Dec18  Apr19      0      0      1      1
5   5  Mercedes  Apr19  Apr19      0      0      0      1
6   6    Passat  Jun18  Jun18      0      1      0      0

In the case where Start and Finish were already derived from the data itself (seems to be the first and last non-zero columns), you can skip all of the dummies and use where instead on the original DataFrame.

df = df.set_index(['ID', 'Car', 'Start', 'Finish'])
res = (df.where(df.ne(0))
         .clip(1,1)
         .interpolate(axis=1, limit_area='inside')
         .fillna(0, downcast='infer')
         .reset_index())

Scott Boston 2019-07-03 23:28:07

Now, that is a cool solution. +1 I wouldn't used interpolate but pd.date_range. Great solution. I think this is better that my first thoughts.

ALollz 2019-07-03 23:28:48

Yeah, I think the "safer" alternative is to convert everything to datetime that way you can deal with missing dates properly. But it seems like Start and Finish may be derived from the data to begin with, in which case this will work, albeit slowly because of the interpolate.

Cindy 2019-07-04 23:13:25

@ALollz, yes, you are right I need to fill between the start and end dates and also keep anything that was non-zero to 1. Now I have the problem that when Start and Finish are the same, code returns interpolation for the Start till to the last existing column, but it should be replace only one value to 1 and stop. For example, I added line#6 in the question.. And in this case 3.0 should be replace to 1 only for Jun18 column and don't continue to Apr19 column. Thanks