How to write a batching rule for tridiagonal solver? #10339

packquickly · 2022-04-18T17:03:11Z

packquickly
Apr 18, 2022

Hi! I've been trying to write a tridiagonal solver in JAX which I need to be able to forward differentiate through. #6843 implemented a primitive for the solver itself, with no forward differentiation rule. I tried using custom_linear_solve to fix this issue:

def intermediate_solve(a, b):
    ld = jnp.concatenate((jnp.array([0]).reshape(1,), jnp.diag(a, -1)), axis = 0)
    d = jnp.diag(a)
    ud = jnp.concatenate((jnp.diag(a, 1), jnp.array([0]).reshape(1,)), axis = 0)
    return lax.linalg.tridiagonal_solve(ld, d, ud, b)
def trisolve(a,b):
    def solve(matvec, x):
        return intermediate_solve(a,x)
    matvec = jax.tree_util.Partial(jnp.dot, a)
    return lax.custom_linear_solve(matvec, b, solve)

but when I call (for some trivial 4x4 tridiagonal matrix A and vector b):
jacfwd(trisolve)(a,b)
I get a notimplemented error for the batching rule. I've tried adding a bit of a sketchy implmentation of this batching rule based on what I observed in _src.linalg

from jax.interpreters import batching
def tridiagonal_solve_add_batch(vals_in, dims_in, m, n, ldb, t):
    print(vals_in)
    x1, x2, x3, x4 = vals_in
    bd1, bd2, bd3, bd4 = dims_in

    if bd1 is not None:
      x1 = batching.moveaxis(x1, bd1, 0)
    if bd2 is not None:
      x2 = batching.moveaxis(x2, bd2, 0)
    if bd3 is not None:
      x3 = batching.moveaxis(x3, bd3, 0)
    if bd4 is not None:
      x4 = batching.moveaxis(x4, bd4, 0)
    outs = tridiagonal_solve_p.bind(x1, x2, x3, x4, m = m, n = n, ldb = ldb, t = t)
    return outs, (0)
  
batching.primitive_batchers[lax.linalg.tridiagonal_solve_p] = tridiagonal_solve_add_batch

and from here I am running into the error
scan got values with different leading axis sizes: 16, 4, 4, 4.
which I have not figured out how to fix. I haven't been able to find much in the way of documentation for implementing these custom batch rules so I apologise if this is blatantly wrong. But could someone help me understand exactly what the issue is with this batching rule?

Answered by jakevdp

Apr 19, 2022

The reason that other batch rules in the file look like the one you created is that those primitives are closed under batching: that is, the primitive itself can handle batched inputs, so in order to compute the batched results, you must simply ensure the inputs are laid out in the expected way and then call the original primitive. tridiagonal_solve, on the other hand, is not closed under batching: that is, you cannot use the primitive directly to compute batched results, so the batch rule is going to have to do something other than call back into the primitive.

Long term, the best way to support batched tridiagonal solves would be to make the primitive closed under batching. This would i…

View full answer

YouJiacheng · 2022-04-19T05:14:45Z

YouJiacheng
Apr 19, 2022

Since lax.linalg.tridiagonal_solve has a pure JAX implementation, I think you can simply copy the implementation source code without declaring it as a primitive.
Note that this is not the implementation used on GPU, which uses cusparse, and I think it can be batched trivially(maybe).
Considering that your error message mention scan, I think you actually use the pure JAX implementation.
https://github.com/google/jax/blob/6b99ee6e48dfe2f4b3febc9283325ca615662883/jax/_src/lax/linalg.py#L1465-L1490
You can just copied the code as your solver, then AD and batching rule are defined appropriately.

5 replies

packquickly Apr 19, 2022
Author

I did try this, and it certainly works, but is slower than what I have above (about 1.5-2x) and is slower than even jnp.linalge.solve when it comes to calculating gradients

YouJiacheng Apr 19, 2022

Is this slower than lax.linalg.tridiagonal_solve when only calculating values without gradient?
But your error message indicate that lax.linalg.tridiagonal_solve is implemented by this on your platform.
Do you use jit?

packquickly Apr 19, 2022
Author

Yes, so to break everything down a bit (I'll refer to this direct code above as tridiag_scan rather than _tridiagonal_solve_jax):
Solve with JIT (no jacfwd):
jnp.linalg.solve ~ 145 µs
trisolve ~ 145 µs
tridiag_scan ~ 250 µs

Solve without JIT (no jacfwd):
jnp.linalg.solve ~ 10 µs
trisolve ~ 70 ms
tridiag_scan ~ 125 ms
(no idea why jnp.linalg.solve performs so well here)

Jacfwd with JIT:
jnp.linalg.solve ~ 1.85 ms
tridiag_scan ~ 1.95 ms

Jacfwd without JIT:
jnp.linalge.solve ~ 2 ms
tridiag_scan ~ 325 ms

So there is definitely a speedup using trisolve vs tridiag_scan. I'm noticing that in more complex applications with JIT the tridiag implemented above performs best (better than a similar C implementation) but obviously can't backprop.

YouJiacheng Apr 19, 2022

Can you solve a bigger problem? It seems that the problem is too small.

packquickly Apr 19, 2022
Author

I must apologise, yes it seems that when run JIT on a larger problem the tridiag_scan runs only slightly slower than trisolve above and is significantly faster than jnp.linalg.solve.

I'm still quite curious on how to implement these batching rules in general so that I can do it in the future but I think that this implementation will probably work fine for my application.

jakevdp · 2022-04-19T16:15:42Z

jakevdp
Apr 19, 2022
Maintainer

The reason that other batch rules in the file look like the one you created is that those primitives are closed under batching: that is, the primitive itself can handle batched inputs, so in order to compute the batched results, you must simply ensure the inputs are laid out in the expected way and then call the original primitive. tridiagonal_solve, on the other hand, is not closed under batching: that is, you cannot use the primitive directly to compute batched results, so the batch rule is going to have to do something other than call back into the primitive.

Long term, the best way to support batched tridiagonal solves would be to make the primitive closed under batching. This would involve changes at the XLA or maybe C++ level.

In the meantime, you could probably implement a working batch rule by calling back into the python implementation with vmap. I've not tested this, but it might look something like this:

def _tridiagonal_solve_batch_rule(vals_in, dims_in, **kw):
    return vmap(partial(_tridiagonal_solve_jax, **kw), dims_in)(*vals_in), (0,)

batching.primitive_batchers[lax.linalg.tridiagonal_solve_p] = _tridiagonal_solve_batch_rule

Note that the performance of this will likely not be that good, since it's calling the JAX implementation rather than the native implementation. An alternative would be to write a batch rule that explicitly loops over the batch dimensions, calling tridiagonal_solve_p.bind multiple times on the unbatched inputs. Then you get the performance benefit of calling into the native implementation, the downside being that you incur the overhead of many such calls.

Does that answer your question?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to write a batching rule for tridiagonal solver? #10339

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to write a batching rule for tridiagonal solver? #10339

Uh oh!

packquickly Apr 18, 2022

Replies: 2 comments · 5 replies

Uh oh!

YouJiacheng Apr 19, 2022

Uh oh!

Uh oh!

packquickly Apr 19, 2022 Author

Uh oh!

YouJiacheng Apr 19, 2022

Uh oh!

packquickly Apr 19, 2022 Author

Uh oh!

YouJiacheng Apr 19, 2022

Uh oh!

packquickly Apr 19, 2022 Author

Uh oh!

jakevdp Apr 19, 2022 Maintainer

packquickly
Apr 18, 2022

Replies: 2 comments 5 replies

YouJiacheng
Apr 19, 2022

packquickly Apr 19, 2022
Author

packquickly Apr 19, 2022
Author

packquickly Apr 19, 2022
Author

jakevdp
Apr 19, 2022
Maintainer