Warm tip: This article is reproduced from stackoverflow.com, please click
pandas python sparse-matrix

How to get around slow groupby for a sparse matrix?

发布于 2020-03-29 21:02:46

I have a large matrix (~200 million rows) describing a list of actions that occurred every day (there are ~10000 possible actions). My final goal is to create a co-occurrence matrix showing which actions happen during the same days.

Here is an example dataset:

data = {'date':   ['01', '01', '01', '02','02','03'],
        'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])

I tried to create a sparse matrix with pd.get_dummies, but unravelling the matrix and using groupby on it is extremely slow, taking 6 minutes for just 5000 rows.

# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)

# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()

# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)

I've also tried:

  1. getting the dummies in non-sparse format;
  2. using groupby for aggregation;
  3. going to sparse format before matrix multiplication.

But I fail in step 1, since there is not enough RAM to create such a large matrix.

I would greatly appreciate your help.

Questioner
Dudelstein
Viewed
64
Dudelstein 2020-01-31 18:51

I came up with an answer using only sparse matrices based on this post. The code is fast, taking about 10 seconds for 10 million rows (my previous code took 6 minutes for 5000 rows and was not scalable).

The time and memory savings come from working with sparse matrices until the very last step when it is necessary to unravel the (already small) co-occurrence matrix before export.

## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)

## Add an auxiliary variable
df['count'] = 1

## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
                shape=(date_c.categories.size, action_c.categories.size))

## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)

## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(), 
       index = action_c.categories, columns = action_c.categories)