Warm tip: This article is reproduced from serverfault.com, please click

Is there a way to have fast boolean operations on scipy.sparse matrices?

发布于 2020-12-01 14:19:15

I have to solve a XOR operation on very high dimensional (~30'000) vectors to compute the Hamming distance. For example, I need to compute the XOR operation between one vector full of False with 16 sparsily located True with each row of a 50'000x30'000 matrix.

As of now, the quickest way I found is to not use scipy.sparse but the simple ^ operation on each row.

This:

l1distances=(self.hashes[index,:]^self.hashes[all_points,:]).sum(axis=1)

Happens to be ten times faster than this:

sparse_hashes = scipy.sparse.csr_matrix((self.hashes)).astype('bool')
for i in range(all_points.shape[0]):
    l1distances[0,i]=(sparse_hashes[index]-sparse_hashes[all_points[i]]).sum()

But ten times faster is still quite slow since, theoretically, having a sparse vector with 16 activations should make the computation the same as having a 16 dimension one.

Is there any solution? I'm really struggling here, thanks for the help!

Questioner
olivier clerc
Viewed
0
CJR 2020-12-02 05:09:07

If your vector is highly sparse (like 16/30000) I'd probably just skip fiddling with sparse xor entirely.

from scipy import sparse
import numpy as np
import numpy.testing as npt

matrix_1 = sparse.random(10000, 100, density=0.1, format='csc')
matrix_1.data = np.ones(matrix_1.data.shape, dtype=bool)

matrix_2 = sparse.random(1, 100, density=0.1, format='csc', dtype=bool)
vec = matrix_2.A.flatten()

# Pull out the part of the sparse matrix that matches the vector and sum it after xor
matrix_xor = (matrix_1[:, vec].A ^ np.ones(vec.sum(), dtype=bool)[np.newaxis, :]).sum(axis=1)

# Sum the part that doesnt match the vector and add it
l1distances = matrix_1[:, ~vec].sum(axis=1).A.flatten() + matrix_xor

# Double check that I can do basic math
npt.assert_array_equal(l1distances, (matrix_1.A ^ vec[np.newaxis, :]).sum(axis=1))