Understanding SPMD in jax #9334

srossi93 · 2022-01-26T10:19:56Z

srossi93
Jan 26, 2022

This is not only a question but rather an opportunity to discuss ways to do SPMD efficiently in jax (especially for TPU backends).
The personal motivation for this is to understand why pmap on 8 replicas on 8 devices is always consistently slower than jit on 1 replica on 1 device, even in the best condition of no inter-communication across replicas (e.g. no pmean, psum, etc...).

So, imaging a simple training loop similar to the mnist example, where params and batch have the leading dimension equal to the number of devices:

@jax.pmap
def update(params, batch):
    grads = grad(loss_fn)(params, batch)
    return jax.tree_map(lambda x, g: x - step_size * g, params, grads)

for _ in range(iters):
    batch = next(batches) 
    params = update(params, batch)

Now, pushing the batch on the devices with something like the function below (from the Flax library) recovers the single device performance

def device_put_sharded(sharded_tree, devices):
    leaves, treedef = jax.tree_flatten(sharded_tree)
    n = leaves[0].shape[0]
    return jax.device_put_sharded(
        [jax.tree_unflatten(treedef, [l[i] for l in leaves]) for i in range(n)],
        devices)

batch = device_put_sharded(next(batches) , devices)
for _ in range(iters):    
    params = update(params, batch)

Now, of course this is wrong but it makes it clear that the process is IO bounded and the bottleneck is indeed going from np.array to jax.ShardedDeviceArray.
So, first question is: why and is this expected? I'm not the first one to experience this behavior (see e.g. #6631, #2459, #6626, #8281), which might indicate that either we are doing something wrong or that there is a problem with shared arrays.
Where are the shared arrays stored?
I can follow-up with an issue, but I think this is the best place to discuss it.

In a more abstract way, what I would like to do is to replicate the data only on the device that eventually it will use it. Apparently this is what it implemented in the class GlobalDeviceArray: citing from the doc, "A GlobalDeviceArray (GDA) can be thought of as a view into a single logical array sharded across processes [...]. Each process can only directly access the shards of the global array data stored on its local devices".
This requires to work with meshes, partitions, pjit... and honestly I'm now completely lost.

Can you guys shed some light on the state of SPMD in jax and how to achieve it efficiently?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding SPMD in jax #9334

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Understanding SPMD in jax #9334

Uh oh!

Uh oh!

srossi93 Jan 26, 2022

Replies: 0 comments

srossi93
Jan 26, 2022