Vectorising sparse-sparse MM over a dense dimension #11912

rciric · 2022-08-15T00:44:22Z

rciric
Aug 15, 2022

Hello, and many thanks for this incredible framework. I’ve been working on rebuilding the core of a scientific software project to use Jax instead of PyTorch, in part to take advantage of the IMO more intuitive implicit differentiation. However, I’m having some difficulty figuring out the best way to vectorise a sparse-sparse matrix multiplication (e.g., bcoo_dot_general or @) over a dense dimension.

To provide some context, I have 2 BCOO matrices that I’d like to multiply, each of which have 2 sparse dimensions and might have one or more dense dimensions. Some examples of situations where this kind of matrix structure arises:

Graphs whose edges have multiple attributes, such that each cross-section along the dense dimension encodes a different attribute of each edge
Batches of graphs or matrices where we wish to enforce the same sparsity structure across all members of the batch. Intuitively it feels like we should be able to achieve more efficient vectorisation in this regime if we encode the batch dimension as dense instead of sparse. But there’s a good chance I’m wrong about this since I’m not familiar with the algorithmic implementations.

The problem I’m having is that simply applying bcoo_dot_general is a no-go, since it is incompatible with dense dimensions. I've also tried writing up a vmap-based solution to vectorise over a final dense dimension:

vmap(partial(
    jax.experimental.sparse.bcoo_dot_general,
    dimension_numbers=(((-3,), (-3,)), ((), ()))
), in_axes=(-1, -1))

But applying this to a sparse matrix results in a ValueError. (I've also tried setting the batch and vmap axes to -1, but this raises a different error.)

Since I wasn't able to get it working with bcoo_dot_general, I found a straightforward way to formulate a naive solution (probably with some bugs: spspmm below; the other functions synthesise data to test on):

import jax
import numpy as np
import jax.numpy as jnp
from jax.experimental.sparse import BCOO


def random_sparse(key, shape, density=0.1):
    """
    Generate a random sparse matrix.
    """
    n = jnp.prod(jnp.array(shape))
    nse = int(density * n)
    k1, k2 = jax.random.split(key)
    indices = jax.random.choice(k1, a=n, shape=(nse,), replace=False)
    indices = jnp.stack(jnp.unravel_index(indices, shape), axis=-1)
    data = jax.random.normal(k2, (nse,))
    return BCOO((data, indices), shape=shape).sum_duplicates()


def to_batch(matrices):
    """
    Convert a sequence of sparse matrices to a batch of matrices using the
    batch-final, common-index COO format.

    .. note::
        This function is not intended to be compatible with JIT compilation.
    """
    batch_size = len(matrices)
    shape = sum([m.data.shape[0] for m in matrices])
    remaining_shape = matrices[0].data.shape[1:]
    indices = jnp.concatenate([m.indices for m in matrices], axis=0)
    data = jnp.zeros((shape, *remaining_shape, batch_size))
    start = 0
    for i, matrix in enumerate(matrices):
        end = start + matrix.data.shape[0]
        data = data.at[start:end, ..., i].set(matrix.data)
        start = end
    return BCOO(
        (data, indices),
        shape=(*matrices[0].shape, batch_size)
    ).sum_duplicates()


def _get_dense_dim_mm(lhs, rhs):
    """
    Get the dense dimension of the matrix multiplication.
    """
    lhs_dims = lhs.data.shape[1:]
    rhs_dims = rhs.data.shape[1:]
    # we don't check for broadcastability here
    return [max(l, r) for l, r in zip(lhs_dims, rhs_dims)]


def spspmm(lhs, rhs, inner_dims=(0, 0), outer_dims=(1, 1)):
    """
    Sparse-sparse matrix multiplication with vectorisation over dense
    dimensions.
    """
    # only support 2D sparse for now
    assert lhs.n_sparse == rhs.n_sparse == 2
    dense_dim_out = _get_dense_dim_mm(lhs, rhs)
    out_shape = (
        lhs.shape[outer_dims[0]],
        rhs.shape[outer_dims[1]],
        *dense_dim_out
    )

    out_nse = lhs.nse * rhs.nse # memory use scales as product of NSEs
    lhs_data = lhs.data[None, ...]
    rhs_data = rhs.data[:, None, ...]

    lhs_contract_dim, rhs_contract_dim = inner_dims
    lhs_contract_idx = lhs.indices[:, lhs_contract_dim][None, :]
    rhs_contract_idx = rhs.indices[:, rhs_contract_dim][:, None]
    out_nonzero = (lhs_contract_idx == rhs_contract_idx)
    extra_idx = [None] * len(dense_dim_out)
    out_nonzero = out_nonzero[tuple([...] + extra_idx)]
    out_data = jnp.where(out_nonzero, lhs_data * rhs_data, 0.)

    lhs_indices = jnp.ones_like(lhs.indices).at[:, -2].set(
        lhs.indices[:, outer_dims[0]])
    rhs_indices = jnp.ones_like(rhs.indices).at[:, -1].set(
        rhs.indices[:, outer_dims[1]])
    out_indices = (lhs_indices[None, ...] * rhs_indices[:, None, ...])

    out_indices = out_indices.reshape(out_nse, -1)
    out_data = out_data.reshape(out_nse, *dense_dim_out)
    return BCOO((out_data, out_indices), shape=out_shape)


shape_A = (10, 8)
shape_B = (10, 18)
batch_size = 5
density = 0.1
batch_A = [
    random_sparse(
        jax.random.PRNGKey(np.random.randint(2 ** 32)),
        shape_A,
        density=density)
    for _ in range(batch_size)]
batch_B = [
    random_sparse(
        jax.random.PRNGKey(np.random.randint(2 ** 32)),
        shape_B,
        density=density)
    for _ in range(batch_size)]
A = to_batch(batch_A)
B = to_batch(batch_B)
out = spspmm(A, B).todense()
ref = np.stack([(a.T @ b).todense() for a, b in zip(batch_A, batch_B)], axis=-1)
assert out.shape == (shape_A[1], shape_B[1], batch_size)
assert np.allclose(out, ref)

But this naive implementation suffers from a critical flaw. Notably, the memory use scales as the product of the number of nonzero elements (nse) of the two inputs, because it forms an intermediate representation that includes all pairwise products. Thus, this implementation will not feasibly fit in accelerator memory for graphs with hundreds of thousands or millions of edges. Conversely, allocating only at nonzero elements I assume would not play well with JIT compilation since the output shape of ~.data and ~.indices might not be something that can be determined from a trace. It'd also likely be much more challenging to vectorise.

I think that what I’d like ideally is a way to simultaneously (i) "promise" the compiler exactly what indices should appear in the output, (ii) vectorise the matrix multiplication over the dense dimensions, and (iii) evaluate it only at the specified indices so as to save on memory (and do this all in a differentiable way). (I suppose I'm really asking more than one question here — sorry!) Is there a way to do this, or something like it, with Jax? I saw there was a bcoo_dot_general_sampled function that I first thought might be a solution, but the use case seems to be for computing a dense-dense matrix product and then sampling indices from the output to form a BCOO matrix.

Any help or leads would be appreciated — thanks in advance!

Answered by jakevdp

Aug 15, 2022

From your example, it's hard for me to understand the exact goal of your question (e.g. "2 BCOO matrices... each of which have 2 sparse dimensions and might have one or more dense dimensions"). Are you saying the matrices are structured something like this?

import numpy as np
from jax.experimental import sparse
mat_dense = np.random.rand((3, 4, 5))
mat = sparse.BCOO.fromdense(mat_dense, n_dense=1)

And then you're hoping to apply vmap over the last dimension?

If so, then I wonder if you could restructure your problem to use n_batch=1 rather than n_dense=1: the reason for the existence of batch dimensions is to enable vmapping over sparse matrix operations. When you say "Batches of graphs …

View full answer

jakevdp · 2022-08-15T03:24:49Z

jakevdp
Aug 15, 2022
Maintainer

From your example, it's hard for me to understand the exact goal of your question (e.g. "2 BCOO matrices... each of which have 2 sparse dimensions and might have one or more dense dimensions"). Are you saying the matrices are structured something like this?

import numpy as np
from jax.experimental import sparse
mat_dense = np.random.rand((3, 4, 5))
mat = sparse.BCOO.fromdense(mat_dense, n_dense=1)

And then you're hoping to apply vmap over the last dimension?

If so, then I wonder if you could restructure your problem to use n_batch=1 rather than n_dense=1: the reason for the existence of batch dimensions is to enable vmapping over sparse matrix operations. When you say "Batches of graphs or matrices where we wish to enforce the same sparsity structure across all members of the batch," this is exactly the semantics of the (leading) batch dimensions, over which vmap is currently well-supported.

Side note: to this point, we haven't done much with the (trailing) dense dimensions, because they haven't seemed very useful in any practical application we've so far come across.

5 replies

rciric Aug 15, 2022
Author

Hello, and thanks for the prompt response! Yes, that is pretty much the structure I had in mind for the matrices. That makes sense — swapping out trailing dense dimensions and using batch dimensions instead would be sufficient to resolve the vectorisation issue. (Side note: I think I misunderstood the purposes of the leading batch dimensions vs. the trailing dense dimensions, because it looks like using the leading batch dimensions requires storing a separate copy of the indices for each batch member, while a single copy of the indices was sufficient for my use case due to the shared sparsity structure.)

For the second issue (related to the memory use of the MM operation — perhaps this should have been a separate post), it looks like the memory use of the batched result also scales as the product of lhs.nse and rhs.nse. The reason for this has been discussed and asked elsewhere, but it does seem like it could pose a problem when working with large graphs where the nse of the RHS and LHS could each be hundreds of thousands or more.

One approach could be specifying the exact indices that are to appear in the output when calling a sparse-sparse MM function, and producing results only for those indices. This way, (i) the output would be statically sized and (ii) situations where the actual memory use is much better than the worst case could feasibly fit in GPU (or other accelerator) memory. Is this (or something like it) supported by Jax? Or is this something that XLA can't do?

jakevdp Aug 15, 2022
Maintainer

Side note: I think I misunderstood the purposes of the leading batch dimensions vs. the trailing dense dimensions, because it looks like using the leading batch dimensions requires storing a separate copy of the indices for each batch member, while a single copy of the indices was sufficient for my use case due to the shared sparsity structure.

You're allowed to have different sparsity per batch (so long as the nse is the same), but you can also have a single index array representing all batches if you wish.

jakevdp Aug 15, 2022
Maintainer

Regarding memory use - agreed this is problematic, because XLA's static shape requirement doesn't allow us to do better. I'd hope that work on dynamic shapes may be able to address this eventually.

jakevdp Aug 15, 2022
Maintainer

I'll add: one way that this can be addressed is to change the sparsity layout: for example, if you have a size [N, N] matrix where you know that every row has at most k entries, you could either use two sparse dimensions with nse = k * N, or you could use a batch dimension and a sparse dimension with nse = k. In the latter, a matrix-matrix multiply will be far better behaved. I've worked with folks for whom this change made a huge difference.

rciric Aug 15, 2022
Author

Thanks so much for the pointers and details — this is incredibly helpful! In this case, I really have no reason to use the trailing dense dimensions.

I'll add: one way that this can be addressed is to change the sparsity layout: for example, if you have a size [N, N] matrix where you know that every row has at most k entries, you could either use two sparse dimensions with nse = k * N, or you could use a batch dimension and a sparse dimension with nse = k. In the latter, a matrix-matrix multiply will be far better behaved. I've worked with folks for whom this change made a huge difference.

My highest memory demand use case in fact falls in exactly this regime. I think this will save me a lot of pain down the line — thanks again; I believe this is now resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorising sparse-sparse MM over a dense dimension #11912

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Vectorising sparse-sparse MM over a dense dimension #11912

Uh oh!

rciric Aug 15, 2022

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

jakevdp Aug 15, 2022 Maintainer

Uh oh!

rciric Aug 15, 2022 Author

Uh oh!

Uh oh!

jakevdp Aug 15, 2022 Maintainer

Uh oh!

jakevdp Aug 15, 2022 Maintainer

Uh oh!

jakevdp Aug 15, 2022 Maintainer

Uh oh!

rciric Aug 15, 2022 Author

rciric
Aug 15, 2022

Replies: 1 comment 5 replies

jakevdp
Aug 15, 2022
Maintainer

rciric Aug 15, 2022
Author

jakevdp Aug 15, 2022
Maintainer

jakevdp Aug 15, 2022
Maintainer

jakevdp Aug 15, 2022
Maintainer

rciric Aug 15, 2022
Author