How does vmap behave when folding over large arrays? #31581

mjo22 · 2025-09-05T03:26:37Z

mjo22
Sep 5, 2025

Hi, I am doing a batch of optimizations in parallel using vmap, and each model evaluation uses a large lookup table that is a significant fraction of available GPU memory. In code I am using the following pattern:

import functools as ft
import jax
import jax.numpy as jnp
import optax

def model_fn(x, lookup):
    “””Evaluate the model using the lookup table at x”””
    …

def loss_fn(x, lookup, data):
    “””Loss function”””
    model = model_fn(x, lookup)
    return jnp.sum((model - data)**2)

@jax.jit
@ft.partial(jax.vmap, in_axes=[0, 0, 0, None])
def make_step(x, opt_state, data, lookup):
    “””Take gradient step for a batch of parameters and data”””
    grads = jax.grad(loss_fn)(x, lookup, data)
    updates, opt_state = optim.update(grads, opt_state, x)
    x_updated = optax.apply_updates(x, updates)
    return x_updated, opt_state

# Load a large lookup array from disk.
# This can even be a significant fraction of
# GPU memory!
lookup_array = jnp.asarray(…)
# Load batch of parameters and data
x, data = …
# Run optimization in parallel!
optim = optax.adam(…)
opt_state = jax.vmap(lambda _x: optim.init(_x))(x)
for _ in range(100)
   x, opt_state = make_step(x, opt_state, data, lookup_array)

My main question is pretty general: what is going on under the hood with folding in the lookup table into the vmap? For example, are there competing reads of the lookup table with each batch element that could be causing slowdowns? I have not seen any particular speedups using vmap vs jax.lax.map when using this pattern, so I’d like to understand more what’s going on under the hood.

(I’ll also add: if people have general tips for a situation like this: optimizing performance, profiling, etc that would be great.)

jakevdp · 2025-09-05T12:42:54Z

jakevdp
Sep 5, 2025
Maintainer

I'm not sure what you mean by "Evaluate the model using the lookup table at x", and the answer to your question will depend on what operations that implies.

In short, the answer is that vmap will execute the batching rule for each primitive in your lookup operation: for example, a gather at a single index under vmap would become a batched gather at a vector of indices. Does that answer your question?

9 replies

mjo22 Sep 5, 2025
Author

Thank you in advance, this will help me start to interpret these XLA primitives as you say!

jakevdp Sep 5, 2025
Maintainer

There are no race conditions because vmap doesn't result in threads being executed simultaneously. vmap is not a parallelizing transformation, it's a vectorizing transformation.

A simple example: here's a function that does a dot product:

import jax

def f(x, y):
  return x @ y

x = jax.numpy.zeros(10)
y = jax.numpy.zeros(10)
print(jax.make_jaxpr(f)(x, y))

{ lambda ; a:f32[10] b:f32[10]. let
    c:f32[] = dot_general[
      dimension_numbers=(([0], [0]), ([], []))
      preferred_element_type=float32
    ] a b
  in (c,) }

The jaxpr shows that it lowers to a single call to dot_general over the two vectors.

If we vmap it over batches of y, we get this:

vmap_f = jax.vmap(f, in_axes=(None, 0))
y_batched = jax.numpy.zeros((20, 10))
print(jax.make_jaxpr(vmap_f)(x, y_batched))

{ lambda ; a:f32[10] b:f32[20,10]. let
    c:f32[20] = dot_general[
      dimension_numbers=(([0], [1]), ([], []))
      preferred_element_type=float32
    ] a b
  in (c,) }

The result is not some threaded loop over vector dot products, rather it is a single vector-matrix product. That's what I mean when I say vmap is a vectorizing transform rather than a parallelizing transform: it results in vectorized code that efficiently computes the batched operation you encoded in the vmap call.

In your case, it will be similar: whatever operation you put in the vmap will be converted to a vectorized version of that operation. No race conditions come into play, because you're executing a single batched operation, not many parallel threaded operations.

Does that make sense?

DSilva27 Sep 5, 2025

Is this still true when the operation is slicing/indexing an array? I think what @mjo22 is asking is this.

import jax

def f(x, indices):
    return g(x[indices])
    
x = ... # some array with shape (d)
batched_indices = ... # some array with shape (B, N)

vmap_f = jax.vmap(f, in_axes=(None, 0))
vmap_f(x, batched_indices)

How does vmap work when accessing the entries of x?

mjo22 Sep 5, 2025
Author

Thank you @DSilva27, I am asking if this is implemented efficiently! Though thank you @jakevdp this is very helpful and clears up some confusion for me, and I see how that it is easy for us to evaluate the jax.make_jaxpr of this expression.

jakevdp Sep 5, 2025
Maintainer

x[indices] is a gather, and the batching rule of a gather is another gather, but with more indices.

How does vmap behave when folding over large arrays? #31581

Uh oh!

Uh oh!

mjo22 Sep 5, 2025

Replies: 1 comment · 9 replies

Uh oh!

Uh oh!

jakevdp Sep 5, 2025 Maintainer

Uh oh!

mjo22 Sep 5, 2025 Author

Uh oh!

jakevdp Sep 5, 2025 Maintainer

Uh oh!

DSilva27 Sep 5, 2025

Uh oh!

Uh oh!

mjo22 Sep 5, 2025 Author

Uh oh!

jakevdp Sep 5, 2025 Maintainer

mjo22
Sep 5, 2025

Replies: 1 comment 9 replies

jakevdp
Sep 5, 2025
Maintainer

mjo22 Sep 5, 2025
Author

jakevdp Sep 5, 2025
Maintainer

mjo22 Sep 5, 2025
Author

jakevdp Sep 5, 2025
Maintainer