Optimizing order of matrix multiplications with `jax.jit` #9199

khdlr · 2022-01-14T08:24:31Z

khdlr
Jan 14, 2022

Is there a way to get the jax.jit/XLA pipeline to optimize the order of matrix multiplications?

I'm currently working on a project that requires lots of chained linear transformations on large coordinate grids, i.e. of the form

   A_1 @ A_2 @ A_3 @ ... @ A_n @ coords   # A_n: [3 x 3], coords: [3 x 100000]

However, due to dynamic (user-defined) behaviour at compile-time, the code will actually look more like

coords = <initial_coordinates>
coords = A_n @ coords
...
coords = A_2 @ coords
coords = A_1 @ coords

Clearly, doing this naively is much less efficient than doing the left-associative (A_1 @ ... @ A_n) @ coords, as the huge final dimension of coords is dragged through every single matmul.

I was hoping that just wrapping my code in a jax.jit would cause the compiler to figure out the optimal order of operations, akin to np.linalg.mult_dot.

However that doesn't appear to be the case. Is this not done for numerical reasons, or will it be supported at some point?

Here's a MWE of what I'm trying to do:

import jax
import jax.numpy as jnp

def make_fn(shortcut=False):
    def fn(theta):
        # Create some coordinates and a simple rotation matrix
        coords = jnp.mgrid[-3000:3000, -3000:3000].reshape(2, -1) / 100
        rot = jnp.array([[jnp.cos(theta), jnp.sin(theta)], [-jnp.sin(theta), jnp.cos(theta)]])
        
        if shortcut:
            transform = rot @ -rot @ rot @ -rot @ rot @ -rot @ rot @ -rot
            coords = transform @ coords
        else:
            coords =  rot @ coords
            coords = -rot @ coords
            coords =  rot @ coords
            coords = -rot @ coords
            coords =  rot @ coords
            coords = -rot @ coords
            coords =  rot @ coords
            coords = -rot @ coords

        return coords
    return fn

fn_naive = make_fn(shortcut=False)
fn_quick = make_fn(shortcut=True)

Timing the functions with and without the "manual" shortcut makes quite the difference:

%timeit a = fn_naive(jnp.pi)  # 25 ms ± 787 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = fn_quick(jnp.pi)  # 3.82 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Looking at the generated HLO via print(jax_to_hlo(fn_naive, input_shapes=([('theta', xla_client.Shape("f32[]"))]))[1]) confirms that these huge matmuls persist in the optimized code:

jit_fn__47.5 {
  constant.9 = pred[] constant(false)
  parameter.8 = f32[] parameter(2)
  cosine.20 = f32[] cosine(parameter.8)
  broadcast.25 = f32[1]{0} broadcast(cosine.20), dimensions={}
  sine.21 = f32[] sine(parameter.8)
  broadcast.26 = f32[1]{0} broadcast(sine.21), dimensions={}
  concatenate.27 = f32[2]{0} concatenate(broadcast.25, broadcast.26), dimensions={0}
  broadcast.31 = f32[1,2]{1,0} broadcast(concatenate.27), dimensions={1}
  sine.22 = f32[] sine(parameter.8)
  negate.23 = f32[] negate(sine.22)
  broadcast.28 = f32[1]{0} broadcast(negate.23), dimensions={}
  cosine.24 = f32[] cosine(parameter.8)
  broadcast.29 = f32[1]{0} broadcast(cosine.24), dimensions={}
  concatenate.30 = f32[2]{0} concatenate(broadcast.28, broadcast.29), dimensions={0}
  broadcast.32 = f32[1,2]{1,0} broadcast(concatenate.30), dimensions={1}
  concatenate.33 = f32[2,2]{1,0} concatenate(broadcast.31, broadcast.32), dimensions={0}
  negate.44 = f32[2,2]{1,0} negate(concatenate.33)
  negate.41 = f32[2,2]{1,0} negate(concatenate.33)
  negate.38 = f32[2,2]{1,0} negate(concatenate.33)
  negate.35 = f32[2,2]{1,0} negate(concatenate.33)
  parameter.6 = s32[6000]{0} parameter(0)
  broadcast.10 = s32[6000,6000]{1,0} broadcast(parameter.6), dimensions={0}
  broadcast.12 = s32[1,6000,6000]{2,1,0} broadcast(broadcast.10), dimensions={1,2}
  parameter.7 = s32[6000]{0} parameter(1)
  broadcast.11 = s32[6000,6000]{1,0} broadcast(parameter.7), dimensions={1}
  broadcast.13 = s32[1,6000,6000]{2,1,0} broadcast(broadcast.11), dimensions={1,2}
  concatenate.14 = s32[2,6000,6000]{2,1,0} concatenate(broadcast.12, broadcast.13), dimensions={0}
  reshape.15 = s32[2,36000000]{1,0} reshape(concatenate.14)
  convert.16 = f32[2,36000000]{1,0} convert(reshape.15)
  constant.17 = f32[] constant(100)
  broadcast.18 = f32[2,36000000]{1,0} broadcast(constant.17), dimensions={}
  divide.19 = f32[2,36000000]{1,0} divide(convert.16, broadcast.18)
  dot.34 = f32[2,36000000]{1,0} dot(concatenate.33, divide.19), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  dot.36 = f32[2,36000000]{1,0} dot(negate.35, dot.34), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  dot.37 = f32[2,36000000]{1,0} dot(concatenate.33, dot.36), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  dot.39 = f32[2,36000000]{1,0} dot(negate.38, dot.37), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  dot.40 = f32[2,36000000]{1,0} dot(concatenate.33, dot.39), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  dot.42 = f32[2,36000000]{1,0} dot(negate.41, dot.40), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  dot.43 = f32[2,36000000]{1,0} dot(concatenate.33, dot.42), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  dot.45 = f32[2,36000000]{1,0} dot(negate.44, dot.43), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  ROOT tuple.46 = (f32[2,36000000]{1,0}) tuple(dot.45)
}

ENTRY xla_computation_ordered_wrapper__30.50 {
  constant.4 = pred[] constant(false)
  constant.1 = s32[6000]{0} constant({...})
  constant.2 = s32[6000]{0} constant({...})
  parameter.3 = f32[] parameter(0)
  call.47 = (f32[2,36000000]{1,0}) call(constant.1, constant.2, parameter.3), to_apply=jit_fn__47.5
  get-tuple-element.48 = f32[2,36000000]{1,0} get-tuple-element(call.47), index=0
  ROOT tuple.49 = (f32[2,36000000]{1,0}) tuple(get-tuple-element.48)
}

Answered by jakevdp

Jan 14, 2022

I don't think XLA does optimization of matmul orderings, but JAX does expose jnp.multi_dot for this purpose.

View full answer

jakevdp · 2022-01-14T17:11:10Z

jakevdp
Jan 14, 2022
Maintainer

I don't think XLA does optimization of matmul orderings, but JAX does expose jnp.multi_dot for this purpose.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing order of matrix multiplications with `jax.jit` #9199

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Optimizing order of matrix multiplications with jax.jit #9199

Uh oh!

Uh oh!

khdlr Jan 14, 2022

Replies: 1 comment

Uh oh!

jakevdp Jan 14, 2022 Maintainer

Optimizing order of matrix multiplications with `jax.jit` #9199

khdlr
Jan 14, 2022

jakevdp
Jan 14, 2022
Maintainer