pmap vs pjit in SPMD mnist example #12881

lukaemon · 2022-10-20T05:42:20Z

lukaemon
Oct 20, 2022

I'm working on spmd_mnist_classifier_fromscratch.py in the /examples.

pmap data parallel works as expected but when I try to reproduce with pjit, the training won't converge. Must be my incomplete, or wrong understanding about sharding semantics.

In pmap, the param update fn is:

@partial(jax.pmap, axis_name='batch', devices=devices)
def spmd_update(params, batch):
    grads = jax.grad(loss)(params, batch)
    grads = jax.tree_map(lambda x: jax.lax.pmean(x, 'batch'), grads)
    
    return jax.tree_map(lambda p, g: p - step_size * g, params, grads)

I thought the pjit version should be like this:

@partial(
    pjit, 
    in_axis_resources=(None, PartitionSpec('dp')), 
    out_axis_resources=None)
def spmd_update(params, batch):
    grads = jax.grad(loss)(params, batch)
    return jax.tree_map(lambda p, g: p - step_size * g, params, grads)

# in training loop
for epoch in range(num_epochs):
    for _ in range(num_batches):
        with maps.Mesh(np.array(devices), ('dp',)):
            params = spmd_update(params, next(batches))

However, can't wrap my head around how replicated model params along dp axis aggregate the learning of SPMD? Can't use pmean or psum since pjit would ingest parallel ops automatically.

What's wrong with my understanding? Can someone point me a direction to move forward? Paper, code, blog, discussion thread, anything. Much appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pmap vs pjit in SPMD mnist example #12881

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

pmap vs pjit in SPMD mnist example #12881

Uh oh!

Uh oh!

lukaemon Oct 20, 2022

Replies: 0 comments

lukaemon
Oct 20, 2022