How to get around slow groupby for a sparse matrix?

发布于 2020-03-29 21:02:46

I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.

Here is an example dataset:

data = {'date':   ['01', '01', '01', '02','02','03'],
        'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])

I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.

# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)

# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()

# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)

I've also tried:

getting the dummies in non-sparse format;
using groupby for aggregation;
going to sparse format before matrix multiplication.

But I fail in step 1, since there is not enough RAM to create such a large matrix.

I would greatly appreciate your help.

Questioner

Dudelstein

Viewed

Chinese

Original

## Get unique values for date and action date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True) action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True) ## Add an auxiliary variable df['count'] = 1 ## Define a sparse matrix row = df.date.astype(date_c).cat.codes col = df.action.astype(action_c).cat.codes sparse_matrix = csr_matrix((df['count'], (row, col)), shape=(date_c.categories.size, action_c.categories.size)) ## Compute dot product with sparse matrix cooc_sparse = sparse_matrix.T.dot(sparse_matrix) ## Unravel co-occurrence matrix into dense shape cooc = pd.DataFrame(cooc_sparse.todense(), index = action_c.categories, columns = action_c.categories)

How to get around slow groupby for a sparse matrix?

Related issues