Selecting arrays by index in batched computations #21010

ScottAlexanderCameron · 2024-04-30T16:51:58Z

ScottAlexanderCameron
Apr 30, 2024

I have an application (sparse mixture of experts model) where I need to select a matrix from a batch of matrices to use in batched matrix multiplication. For example: I have a batch of inputs xs with shape (B, X_in), and a collection of matrices weights with shape (E, X_in, X_out), and the index of each 'expert' to route the computation to idx (experts are assigned different numbers of inputs). I can achieve this with batched indexing or jax.vmap as shown below however it seems from the jaxpr that this would instantiate an array of size (B, X_in, X_out) which may be prohibitively expensive. I would imaging there should be a way to fuse the indexing operation into the compiled matmul kernel so that the specified matrix is used without having to allocate a copy. In fact, I can do this using a jax.lax.scan; however, as far as I can tell, jax/xla cannot parallelize the scan operation, even when there is no dependency between steps of the loop body.

Basically my question is: is there some way to either convert the vmapped implementation so that it does not instantiate the large (B, X_in, X_out) array, or alternatively tell xla that the scanned implementation should be executed in parallel rather than sequentially?

Example code:

import jax
import jax.numpy as jnp


weights = jnp.zeros((8, 128, 512))
xs = jnp.zeros((256, 128))
idx = jnp.arange(len(xs)) % len(weights)


def broadcast(weights, xs, idx):
    return jnp.squeeze(xs[:, None] @ weights[idx], -2)


def vmapped(weights, xs, idx):
    return jax.vmap(lambda xi, i: xi @ weights[i])(xs, idx)


def scanned(weights, xs, idx):
    return jax.lax.map(lambda a: a[0] @ weights[a[1]], (xs, idx))


print("<===== broadcasted =====>")
print(jax.make_jaxpr(broadcast)(weights, xs, idx))
print()

print("<======= vmapped =======>")
print(jax.make_jaxpr(vmapped)(weights, xs, idx))
print()

print("<======= scanned =======>")
print(jax.make_jaxpr(scanned)(weights, xs, idx))
print()

Output jaxpr:

<===== broadcasted =====>
{ lambda ; a:f32[8,128,512] b:f32[256,128] c:i32[256]. let
    d:f32[256,1,128] = broadcast_in_dim[
      broadcast_dimensions=(0, 2)
      shape=(256, 1, 128)
    ] b
    e:bool[256] = lt c 0
    f:i32[256] = add c 8
    g:i32[256] = select_n e c f
    h:i32[256,1] = broadcast_in_dim[broadcast_dimensions=(0,) shape=(256, 1)] g
    i:f32[256,128,512] = gather[
      dimension_numbers=GatherDimensionNumbers(offset_dims=(1, 2), collapsed_slice_dims=(0,),
 start_index_map=(0,))
      fill_value=None
      indices_are_sorted=False
      mode=GatherScatterMode.PROMISE_IN_BOUNDS
      slice_sizes=(1, 128, 512)
      unique_indices=False
    ] a h
    j:f32[256,1,512] = dot_general[
      dimension_numbers=(([2], [1]), ([0], [0]))
      preferred_element_type=float32
    ] d i
    k:f32[256,512] = squeeze[dimensions=(1,)] j
  in (k,) }

<======= vmapped =======>
{ lambda ; a:f32[8,128,512] b:f32[256,128] c:i32[256]. let
    d:bool[256] = lt c 0
    e:i32[256] = add c 8
    f:i32[256] = select_n d c e
    g:bool[] = lt 0 0
    h:i32[] = add 0 128
    i:i32[] = select_n g 0 h
    j:bool[] = lt 0 0
    k:i32[] = add 0 512
    l:i32[] = select_n j 0 k
    m:i32[256,1] = broadcast_in_dim[broadcast_dimensions=(0,) shape=(256, 1)] f
    n:i32[256,1] = broadcast_in_dim[broadcast_dimensions=() shape=(256, 1)] i
    o:i32[256,1] = broadcast_in_dim[broadcast_dimensions=() shape=(256, 1)] l
    p:i32[256,3] = concatenate[dimension=1] m n o
    q:f32[256,1,128,512] = gather[
      dimension_numbers=GatherDimensionNumbers(offset_dims=(1, 2, 3), collapsed_slice_dims=()
, start_index_map=(0, 1, 2))
      fill_value=None
      indices_are_sorted=False
      mode=GatherScatterMode.PROMISE_IN_BOUNDS
      slice_sizes=(1, 128, 512)
      unique_indices=False
    ] a p
    r:f32[256,128,512] = squeeze[dimensions=(1,)] q
    s:f32[256,512] = dot_general[
      dimension_numbers=(([1], [1]), ([0], [0]))
      preferred_element_type=float32
    ] b r
  in (s,) }

<======= scanned =======>
{ lambda ; a:f32[8,128,512] b:f32[256,128] c:i32[256]. let
    d:f32[256,512] = scan[
      _split_transpose=False
      jaxpr={ lambda ; e:f32[8,128,512] f:f32[128] g:i32[]. let
          h:bool[] = lt g 0
          i:i32[] = add g 8
          j:i32[] = select_n h g i
          k:bool[] = lt 0 0
          l:i32[] = add 0 128
          m:i32[] = select_n k 0 l
          n:bool[] = lt 0 0
          o:i32[] = add 0 512
          p:i32[] = select_n n 0 o
          q:f32[1,128,512] = dynamic_slice[slice_sizes=(1, 128, 512)] e j m p
          r:f32[128,512] = squeeze[dimensions=(0,)] q
          s:f32[512] = dot_general[
            dimension_numbers=(([0], [0]), ([], []))
            preferred_element_type=float32
          ] f r
        in (s,) }
      length=256
      linear=(False, False, False)
      num_carry=0
      num_consts=1
      reverse=False
      unroll=1
    ] a b c
  in (d,) }

Answered by jakevdp

Apr 30, 2024

You may be able to rely on the compiler to fuse operations. Even if the jaxpr indicates an intermediate value of a particular shape, it doesn't necessarily mean that the compiled operation will instantiate that intermediate value.

For example, here's the compiled HLO produced by your broadcast function on a T4 GPU:

print(jax.jit(broadcast).lower(weights, xs, idx).compile().as_text())

HloModule jit_broadcast, is_scheduled=true, entry_computation_layout={(f32[8,128,512]{2,1,0}, f32[256,128]{1,0}, s32[256]{0})->f32[256,512]{1,0}}, allow_spmd_sharding_propagation_to_parameters={true,true,true}, allow_spmd_sharding_propagation_to_output={true}, frontend_attributes={fingerprint_before_lhs="4167…

View full answer

jakevdp · 2024-04-30T19:14:28Z

jakevdp
Apr 30, 2024
Maintainer

You may be able to rely on the compiler to fuse operations. Even if the jaxpr indicates an intermediate value of a particular shape, it doesn't necessarily mean that the compiled operation will instantiate that intermediate value.

For example, here's the compiled HLO produced by your broadcast function on a T4 GPU:

print(jax.jit(broadcast).lower(weights, xs, idx).compile().as_text())

HloModule jit_broadcast, is_scheduled=true, entry_computation_layout={(f32[8,128,512]{2,1,0}, f32[256,128]{1,0}, s32[256]{0})->f32[256,512]{1,0}}, allow_spmd_sharding_propagation_to_parameters={true,true,true}, allow_spmd_sharding_propagation_to_output={true}, frontend_attributes={fingerprint_before_lhs="41670ae9092ff74af5ae5724b3268b49"}

%scalar_add_computation (scalar_lhs.1: f32[], scalar_rhs.1: f32[]) -> f32[] {
  %scalar_rhs.1 = f32[] parameter(1)
  %scalar_lhs.1 = f32[] parameter(0)
  ROOT %add.3.0 = f32[] add(f32[] %scalar_lhs.1, f32[] %scalar_rhs.1)
}

%fused_reduce (param_0.6: f32[8,128,512], param_1.11: f32[256,128], param_2.8: s32[256]) -> f32[256,512] {
  %param_1.11 = f32[256,128]{1,0} parameter(1)
  %broadcast.8.1 = f32[256,512,1,128]{3,2,1,0} broadcast(f32[256,128]{1,0} %param_1.11), dimensions={0,3}, metadata={op_name="jit(broadcast)/jit(main)/dot_general[dimension_numbers=(((2,), (1,)), ((0,), (0,))) precision=None preferred_element_type=float32]" source_file="<ipython-input-1-9a6a4029c8a3>" source_line=11}
  %param_0.6 = f32[8,128,512]{2,1,0} parameter(0)
  %param_2.8 = s32[256]{0} parameter(2)
  %constant_6_1 = s32[] constant(0)
  %broadcast.9.1 = s32[256]{0} broadcast(s32[] %constant_6_1), dimensions={}
  %compare.1.1 = pred[256]{0} compare(s32[256]{0} %param_2.8, s32[256]{0} %broadcast.9.1), direction=LT, metadata={op_name="jit(broadcast)/jit(main)/lt" source_file="<ipython-input-1-9a6a4029c8a3>" source_line=11}
  %constant_4_1 = s32[] constant(8)
  %broadcast.10.1 = s32[256]{0} broadcast(s32[] %constant_4_1), dimensions={}
  %add.4.1 = s32[256]{0} add(s32[256]{0} %param_2.8, s32[256]{0} %broadcast.10.1), metadata={op_name="jit(broadcast)/jit(main)/add" source_file="<ipython-input-1-9a6a4029c8a3>" source_line=11}
  %select.1.1 = s32[256]{0} select(pred[256]{0} %compare.1.1, s32[256]{0} %add.4.1, s32[256]{0} %param_2.8), metadata={op_name="jit(broadcast)/jit(main)/select_n" source_file="<ipython-input-1-9a6a4029c8a3>" source_line=11}
  %bitcast.52.5 = s32[256,1]{1,0} bitcast(s32[256]{0} %select.1.1)
  %gather.6 = f32[256,1,128,512]{2,1,3,0} gather(f32[8,128,512]{2,1,0} %param_0.6, s32[256,1]{1,0} %bitcast.52.5), offset_dims={1,2,3}, collapsed_slice_dims={}, start_index_map={0}, index_vector_dim=1, slice_sizes={1,128,512}, metadata={op_name="jit(broadcast)/jit(main)/gather[dimension_numbers=GatherDimensionNumbers(offset_dims=(1, 2), collapsed_slice_dims=(0,), start_index_map=(0,)) slice_sizes=(1, 128, 512) unique_indices=False indices_are_sorted=False mode=GatherScatterMode.PROMISE_IN_BOUNDS fill_value=None]" source_file="<ipython-input-1-9a6a4029c8a3>" source_line=11}
  %bitcast.55.3 = f32[256,512,1,128]{3,2,1,0} bitcast(f32[256,1,128,512]{2,1,3,0} %gather.6)
  %multiply.4.3 = f32[256,512,1,128]{3,2,1,0} multiply(f32[256,512,1,128]{3,2,1,0} %broadcast.8.1, f32[256,512,1,128]{3,2,1,0} %bitcast.55.3)
  %bitcast.62.1 = f32[256,512,128]{2,1,0} bitcast(f32[256,512,1,128]{3,2,1,0} %multiply.4.3)
  %constant_5 = f32[] constant(0)
  ROOT %reduce.6 = f32[256,512]{1,0} reduce(f32[256,512,128]{2,1,0} %bitcast.62.1, f32[] %constant_5), dimensions={2}, to_apply=%scalar_add_computation, metadata={op_name="jit(broadcast)/jit(main)/dot_general[dimension_numbers=(((2,), (1,)), ((0,), (0,))) precision=None preferred_element_type=float32]" source_file="<ipython-input-1-9a6a4029c8a3>" source_line=11}
}

ENTRY %main.16 (Arg_0.1.0: f32[8,128,512], Arg_1.2.0: f32[256,128], Arg_2.3.0: s32[256]) -> f32[256,512] {
  %Arg_2.3.0 = s32[256]{0} parameter(2)
  %Arg_1.2.0 = f32[256,128]{1,0} parameter(1)
  %Arg_0.1.0 = f32[8,128,512]{2,1,0} parameter(0)
  ROOT %input_reduce_fusion = f32[256,512]{1,0} fusion(f32[8,128,512]{2,1,0} %Arg_0.1.0, f32[256,128]{1,0} %Arg_1.2.0, s32[256]{0} %Arg_2.3.0), kind=kInput, calls=%fused_reduce, metadata={op_name="jit(broadcast)/jit(main)/dot_general[dimension_numbers=(((2,), (1,)), ((0,), (0,))) precision=None preferred_element_type=float32]" source_file="<ipython-input-1-9a6a4029c8a3>" source_line=11}
}

I believe this means the reduction will be fused so that the implied (256, 128, 512) array is never actually created, because it appears within the fused_reduce. You should take a look at this on the particular hardware you are using; for example, on CPU I don't see the same fused reduction.

4 replies

ScottAlexanderCameron May 1, 2024
Author

Thanks for your quick reply. Interesting to see how it differs on CPU vs GPU. Would it be correct to interpret that %fused_reduce is not allocating this memory, even though it contains the following line ?

%gather.6 = f32[256,1,128,512]{2,1,3,0} gather(f32[8,128,512]{2,1,0} %param_0.6, s32[256,1]{1,0} %bitcast.52.5)

Would that imply that the sizes of arrays in the HLO code are symbolic and not concrete?
I would be great if there was some way to enforce or hint to the compiler that the array should never get materialized.

ScottAlexanderCameron May 1, 2024
Author

I've just checked with a memory profiler and it seems that xla does not create the (256,128,512) array on gpu just as you said. But this memory is allocated on cpu.
Thanks again for your answer, it was really helpful!

jakevdp May 1, 2024
Maintainer

Would it be correct to interpret that %fused_reduce is not allocating this memory, even though it contains the following line ?

Yes, precisely. The HLO appearing within a fusion represents the logical operations that will be performed, but you should think of that codeblock as being executed as a single kernel on the device, and intermediate values will not necessarily be materialized (though the compiler can choose to do so if it deems it to be the most efficient approach given available hardware). You can get a sense of what the compiler is doing by using the cost_analysis() of the lowered & compiled function, though this is somewhat imprecise.

There's no way that I know of to enforce non-materialization of any particular intermediate value, unfortunately. If you have very specific logic that you need to control at that level, one option is to write the kernel yourself at a lower level using something like pallas.

ScottAlexanderCameron May 1, 2024
Author

Thanks! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Selecting arrays by index in batched computations #21010

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Selecting arrays by index in batched computations #21010

Uh oh!

ScottAlexanderCameron Apr 30, 2024

Example code:

Output jaxpr:

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

jakevdp Apr 30, 2024 Maintainer

Uh oh!

ScottAlexanderCameron May 1, 2024 Author

Uh oh!

ScottAlexanderCameron May 1, 2024 Author

Uh oh!

Uh oh!

jakevdp May 1, 2024 Maintainer

Uh oh!

Uh oh!

ScottAlexanderCameron May 1, 2024 Author

ScottAlexanderCameron
Apr 30, 2024

Replies: 1 comment 4 replies

jakevdp
Apr 30, 2024
Maintainer

ScottAlexanderCameron May 1, 2024
Author

ScottAlexanderCameron May 1, 2024
Author

jakevdp May 1, 2024
Maintainer

ScottAlexanderCameron May 1, 2024
Author