I need to fill rows by default value =1
in DataFrame based on information in Start
and Finish
columns.
So, limit for filling separated columns in row is defined based on ['Start', 'Finish']
.
DataFrame, df1
is:
ID Car Jan17 Jun18 Dec18 Apr19 Start Finish
0 Nissan 0.0 1.7 3.7 0.0 Jun18 Dec18
1 Porsche 10.0 0.0 2.8 3.5 Jan17 Apr19
2 Golf 0.0 1.7 3.0 2.0 Jun18 Apr19
3 Toyota 1.0 0.0 3.0 5.2 Jan17 Apr19
4 Mazda 0.0 0.0 3.0 4.2 Dec18 Apr19
5 Mercedes 0.0 0.0 0.0 7.2 Apr19 Apr19
6 Passat 0.0 3.0 0.0 0.0 Jun18 Jun18
For instance, if there is a row #0:
Start = Jun18
and Finish = Dec18
.
Values in row#0 should be filled by 1
for columns, started from Jun18
till to Dec18
.
I tried to use numpy.sign()
function, but there is wrong result in case, if 0.0
is between of two non-zero values.
Expected result is df2
:
ID Car Jan17 Jun18 Dec18 Apr19 Start Finish
0 Nissan 0.0 1.0 1.0 0.0 Jun18 Dec18
1 Porsche 1.0 1.0 1.0 1.0 Jan17 Apr19
2 Golf 0.0 1.0 1.0 1.0 Jun18 Apr19
3 Toyota 1.0 1.0 1.0 1.0 Jan17 Apr19
4 Mazda 0.0 0.0 1.0 1.0 Dec18 Apr19
5 Mercedes 0.0 0.0 0.0 1.0 Apr19 Apr19
6 Passat 0.0 1.0 0.0 0.0 Jun18 Jun18
get_dummies
+ interpolate
This requires your columns to be sorted in time order, and for Start and Finish to ideally always exist in the column names.
df = df.set_index(['ID', 'Car', 'Start', 'Finish'])
s1 = (pd.get_dummies(df.index.get_level_values('Start'))
.reindex(df.columns, axis=1)
.replace(0, np.NaN))
s2 = (pd.get_dummies(df.index.get_level_values('Finish'))
.reindex(df.columns, axis=1)
.replace(0, np.NaN))
res = s1.combine_first(s2).interpolate(axis=1, limit_area='inside').fillna(0, downcast='infer')
res.index = df.index
res = res.reset_index()
res
: ID Car Start Finish Jan17 Jun18 Dec18 Apr19
0 0 Nissan Jun18 Dec18 0 1 1 0
1 1 Porsche Jan17 Apr19 1 1 1 1
2 2 Golf Jun18 Apr19 0 1 1 1
3 3 Toyota Jan17 Apr19 1 1 1 1
4 4 Mazda Dec18 Apr19 0 0 1 1
5 5 Mercedes Apr19 Apr19 0 0 0 1
6 6 Passat Jun18 Jun18 0 1 0 0
In the case where Start
and Finish
were already derived from the data itself (seems to be the first and last non-zero columns), you can skip all of the dummies and use where
instead on the original DataFrame.
df = df.set_index(['ID', 'Car', 'Start', 'Finish'])
res = (df.where(df.ne(0))
.clip(1,1)
.interpolate(axis=1, limit_area='inside')
.fillna(0, downcast='infer')
.reset_index())
Now, that is a cool solution. +1 I wouldn't used interpolate but pd.date_range. Great solution. I think this is better that my first thoughts.
Yeah, I think the "safer" alternative is to convert everything to
datetime
that way you can deal with missing dates properly. But it seems likeStart
andFinish
may be derived from the data to begin with, in which case this will work, albeit slowly because of the interpolate.@ALollz, yes, you are right I need to fill between the start and end dates and also keep anything that was non-zero to
1
. Now I have the problem that whenStart
andFinish
are the same, code returns interpolation for theStart
till to the last existing column, but it should be replace only one value to 1 and stop. For example, I added line#6 in the question.. And in this case3.0
should be replace to 1 only forJun18
column and don't continue toApr19
column. Thanks@Cindy good catch. Please see the update. We need to add the
limit_area='inside'
argument to the interpolation@ALollz, yes, after update it works as expected. Thanks a lot!