Using XLA to construct mini-batches for "ragged" batches? #7618

pharringtonp19 · 2021-08-13T13:25:21Z

pharringtonp19
Aug 13, 2021

Coming from pytorch, I am used to manually feeding a batch of inputs into my model at each training step.

I am wondering if via jit, XLA will construct the batch for me via Dead Code Elimination.

Instead of first constructing a minibatch, I would feed the entire dataset into the model together with an array of zeros and ones into my loss function to indicate which predictions/targets I want to use in that training step.

If I apply jit to this training step, would XLA run the forward pass only on the inputs whose prediction are used in the loss function for that training step? Would this allow me to use varying batch sizes during training?

Answered by jakevdp

Aug 13, 2021

Hi - thanks for the question! When you mention Dead Code Ellimination, are you talking about eliminating computations over masked sections of the array? I'm not sure whether this is possible.

To strip this down and be concrete, I think essentially what you're asking is whether you can write a program like this:

import jax.numpy as jnp
import jax

key1, key2 = jax.random.split(random.PRNGKey(1701))

x = jax.random.uniform(key1, (10,))
mask = jax.random.randint(key1, (10,), 0, 2).astype(bool)

def f(x, mask):
  return jnp.where(mask, jnp.sin(x), 0)

f(x, mask)

and when running it depend on XLA not computing jnp.sin on the entries that will eventually be zeroed out. Is that correct?

I took a…

View full answer

jakevdp · 2021-08-13T23:46:44Z

jakevdp
Aug 13, 2021
Maintainer

Hi - thanks for the question! When you mention Dead Code Ellimination, are you talking about eliminating computations over masked sections of the array? I'm not sure whether this is possible.

To strip this down and be concrete, I think essentially what you're asking is whether you can write a program like this:

import jax.numpy as jnp
import jax

key1, key2 = jax.random.split(random.PRNGKey(1701))

x = jax.random.uniform(key1, (10,))
mask = jax.random.randint(key1, (10,), 0, 2).astype(bool)

def f(x, mask):
  return jnp.where(mask, jnp.sin(x), 0)

f(x, mask)

and when running it depend on XLA not computing jnp.sin on the entries that will eventually be zeroed out. Is that correct?

I took a look at the HLO generated for this, and it doesn't look like XLA eliminates operations within an array like this

print(jax.xla_computation(f)(x, mask).as_hlo_module().to_string())

HloModule xla_computation_f.18

%jit__where__1.6 (parameter.7: pred[10], parameter.8: f32[10], parameter.9: s32[]) -> (f32[10]) {
  %constant.10 = pred[] constant(false)
  %parameter.7 = pred[10]{0} parameter(0)
  %parameter.8 = f32[10]{0} parameter(1)
  %parameter.9 = s32[] parameter(2)
  %convert.11 = f32[] convert(s32[] %parameter.9), metadata={op_type="convert_element_type" op_name="xla_computation(f)/jit(_where)/convert_element_type[ new_dtype=float32\n                                                     weak_type=False ]" source_file="<ipython-input-6-7666fc672333>" source_line=2}
  %broadcast.12 = f32[10]{0} broadcast(f32[] %convert.11), dimensions={}, metadata={op_type="broadcast_in_dim" op_name="xla_computation(f)/jit(_where)/broadcast_in_dim[ broadcast_dimensions=(  )\n                                                 shape=(10,) ]" source_file="<ipython-input-6-7666fc672333>" source_line=2}
  %select.13 = f32[10]{0} select(pred[10]{0} %parameter.7, f32[10]{0} %parameter.8, f32[10]{0} %broadcast.12), metadata={op_type="select" op_name="xla_computation(f)/jit(_where)/select" source_file="<ipython-input-6-7666fc672333>" source_line=2}
  ROOT %tuple.14 = (f32[10]{0}) tuple(f32[10]{0} %select.13)
}

ENTRY %xla_computation_f.18 (parameter.1: f32[10], parameter.2: pred[10]) -> (f32[10]) {
  %constant.3 = pred[] constant(false)
  %parameter.2 = pred[10]{0} parameter(1)
  %parameter.1 = f32[10]{0} parameter(0)
  %sine.4 = f32[10]{0} sine(f32[10]{0} %parameter.1), metadata={op_type="sin" op_name="xla_computation(f)/sin" source_file="<ipython-input-6-7666fc672333>" source_line=2}
  %constant.5 = s32[] constant(0), metadata={op_type="xla_call" op_name="xla_computation(f)/xla_call[ backend=None\n                             device=None\n                             donated_invars=(False, False, False)\n                             inline=False\n                             name=_where ]" source_file="<ipython-input-6-7666fc672333>" source_line=2}
  %call.15 = (f32[10]{0}) call(pred[10]{0} %parameter.2, f32[10]{0} %sine.4, s32[] %constant.5), to_apply=%jit__where__1.6, metadata={op_type="xla_call" op_name="xla_computation(f)/xla_call[ backend=None\n                             device=None\n                             donated_invars=(False, False, False)\n                             inline=False\n                             name=_where ]" source_file="<ipython-input-6-7666fc672333>" source_line=2}
  %get-tuple-element.16 = f32[10]{0} get-tuple-element((f32[10]{0}) %call.15), index=0, metadata={op_type="xla_call" op_name="xla_computation(f)/xla_call[ backend=None\n                             device=None\n                             donated_invars=(False, False, False)\n                             inline=False\n                             name=_where ]" source_file="<ipython-input-6-7666fc672333>" source_line=2}
  ROOT %tuple.17 = (f32[10]{0}) tuple(f32[10]{0} %get-tuple-element.16)
}

It certainly looks like sine computations are being done on the full length-10 array.

There may be a way to do this that I'm not aware of (hopefully someone else will chime in!) but I don't think XLA will do this automatically in the general case.

1 reply

pharringtonp19 Aug 16, 2021
Author

@jakevdp Thanks for the response.

The motivation for this post came from a discussion in Optax, where we want to freeze certain parameters. The idea there was that we could do so by zeroing out the gradients of the parameters that we want to freeze. It was suggested that if the training step was jit-ed, the gradients that eventually get zeroed out would not even be computed in the first place (although given your code example, it would be good to confirm if this is true)

As a further thought, I wonder if the future/next iteration of mask might help here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using XLA to construct mini-batches for "ragged" batches? #7618

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using XLA to construct mini-batches for "ragged" batches? #7618

Uh oh!

pharringtonp19 Aug 13, 2021

Replies: 1 comment · 1 reply

Uh oh!

jakevdp Aug 13, 2021 Maintainer

Uh oh!

pharringtonp19 Aug 16, 2021 Author

pharringtonp19
Aug 13, 2021

Replies: 1 comment 1 reply

jakevdp
Aug 13, 2021
Maintainer

pharringtonp19 Aug 16, 2021
Author