Warm tip: This article is reproduced from serverfault.com, please click

How to vectorize a non-overlapped dataframe to overlapped shiftting dataframe?

发布于 2020-12-06 10:10:14

I would like to transform a regular dataframe to a multi-index dataframe with overlap and shift.

For example, the input dataframe is like this sample code:

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 12).reshape(-1, 2), columns=['d1', 'd2'], dtype=float)
df.index.name = 'idx'
print(df)

Output:

       d1    d2
idx            
0     0.0   1.0
1     2.0   3.0
2     4.0   5.0
3     6.0   7.0
4     8.0   9.0
5    10.0  11.0

What I want to output is: Make it overlap by batch and shift one row per time (Add a column batchid to label every shift), like this (batchsize=4):

               d1    d2
idx batchid            
0   0         0.0   1.0
1   0         2.0   3.0
2   0         4.0   5.0
3   0         6.0   7.0
1   1         2.0   3.0
2   1         4.0   5.0
3   1         6.0   7.0
4   1         8.0   9.0
2   2         4.0   5.0
3   2         6.0   7.0
4   2         8.0   9.0
5   2        10.0  11.0

My work so far: I can make it work with iterations and concat them together. But it will take a lot of time.

batchsize = 4
ds, ids = [], []
idx = df.index.values
for bi in range(int(len(df) - batchsize + 1)):
    ids.append(idx[bi:bi+batchsize])
for k, idx in enumerate(ids):
    di = df.loc[pd.IndexSlice[idx], :].copy()
    di['batchid'] = k
    ds.append(di)
res = pd.concat(ds).fillna(0)
res.set_index('batchid', inplace=True, append=True)

Is there a way to vectorize and accelerate this process?

Thanks.

Questioner
Patrick Lee
Viewed
0
piterbarg 2020-12-06 18:33:24

First we create a 'mask' that will tell us which elements go into which batch id

nrows = len(df)
batchsize = 4
mask_columns = {i:np.pad([1]*batchsize,(i,nrows-batchsize-i)) for i in range(nrows-batchsize+1)}
mask_df = pd.DataFrame(mask_columns)
df = df.join(mask_df)

this adds a few columns to df:


  idx    d1    d2    0    1    2
-----  ----  ----  ---  ---  ---
    0     0     1    1    0    0
    1     2     3    1    1    0
    2     4     5    1    1    1
    3     6     7    1    1    1
    4     8     9    0    1    1
    5    10    11    0    0    1

This now looks like a df with 'dummies', and we need to 'reverse' the dummies:

df2 = df.set_index(['d1','d2'], drop=True)
df2[df2==1].stack().reset_index().drop(0,1).sort_values('level_2').rename(columns = {'level_2':'batchid'})

produces

      d1    d2    batchid
--  ----  ----  ---------
 0     0     1          0
 1     2     3          0
 3     4     5          0
 6     6     7          0
 2     2     3          1
 4     4     5          1
 7     6     7          1
 9     8     9          1
 5     4     5          2
 8     6     7          2
10     8     9          2
11    10    11          2