The choice of sharding APIs #31597

AakashKumarNain · 2025-09-05T16:44:35Z

AakashKumarNain
Sep 5, 2025

I think it is great to that we now have a choice of explicit and manual sharding APIs, but one thing have left me confused. Every time someone asks me what API should they choose for their JAX workflows, I almost always think of the caveats, and as as result I almost lean towards shard_map because it is as transparent as it gets. But shard_map also comes with a lot of verbosity for simple workflows, and it is always not clear what's the best strategy to opt for especially if you are running things on a small scale but the workflow has to be scaled up for the final run. To elaborate the last point, let us look at an example (directly from JAX docs) demonstrating the FSDP + TP workflow.

def init_layer(key, n_in, n_out):
  k1, k2 = jax.random.split(key)
  W = jax.random.normal(k1, (n_in, n_out)) / jnp.sqrt(n_in)
  b = jax.random.normal(k2, (n_out,))
  return W, b

def init(key, layer_sizes, batch_size):
  key, *keys = jax.random.split(key, len(layer_sizes))
  params = list(map(init_layer, keys, layer_sizes[:-1], layer_sizes[1:]))

  key, *keys = jax.random.split(key, 3)
  inputs = jax.random.normal(keys[0], (batch_size, layer_sizes[0]))
  targets = jax.random.normal(keys[1], (batch_size, layer_sizes[-1]))

  return params, (inputs, targets)


mesh = jax.make_mesh((4, 2), ('batch', 'feats'))

batch_ = jax.device_put(batch, NamedSharding(mesh, P('batch', 'feats')))
params_ = jax.device_put(params, NamedSharding(mesh, P(('batch', 'feats'))))

# mostly same as previous predict_fsdp definition, except we call gemm_tp
@partial(jax.remat, policy=lambda op, *_, **__: str(op) != 'all_gather')
def predict_fsdp_tp(params_frag, inputs):
  for W_frag, b_frag in params_frag:
    W = jax.lax.all_gather(W_frag, 'batch', tiled=True)
    b = jax.lax.all_gather(b_frag, 'batch', tiled=True)
    block_result = jnp.dot(inputs, W)
    outputs = jax.lax.psum_scatter(block_result, 'feats',
                                   scatter_dimension=1, tiled=True) + b
    inputs = jax.nn.relu(outputs)
  return outputs

@partial(jax.shard_map, mesh=mesh,
         in_specs=(P(('feats', 'batch')), P('batch', 'feats')),
         out_specs=P())
def loss_fsdp_tp(local_params, local_batch):
  inputs, targets = local_batch
  predictions = predict_fsdp_tp(local_params, inputs)
  loss = ...  # calculate CE
  return jax.lax.pmean(loss, 'batch')

We will modify this example to a classification problem but on a toy dataset before we run this pipeline on scale. For the toy dataset, you can consider any dataset. Though MNIST and CIFAR-10 is not something people really work on, but just for the demonstration purpose:

layer_sizes = [784, 512, 256, 128, 10]
batch_size = 32
params, batch = init(jax.random.key(0), layer_sizes, batch_size)

In this setup, except for the out features of the final layer, everything can be sharded easily. We will shard only the weights, not the biases, of all the hidden layers, and we will keep the last layer replicated for simplicity. What's the ideal strategy here? Should one do shard_map for hidden layers and separate out the final layer forward pass? Or should we explicit sharding with it?

A few other comments related to the original example:

The FSDP-TP strategy here is a bit different than how it is generally laid out as shown in the scaling book. When I say "different", I mean here we are doing 8-way sharding on the input logical axis, as opposed to following
the general colwise-rowwise pattern. I haven't done the maths for collective on paper, so can't say much about the communication time. Is there any special reason behind doing this partitioning?
The params in the example are sharded as P(('batch', 'feats'))) but in_specs is set to P(('feats', 'batch'))). Does that mean irrespective of the original sharding of an array, the blocks will always be laid out according to in_specs? If not, then why change the partitionspec?

yashk2810 · 2025-09-05T19:22:08Z

yashk2810
Sep 5, 2025
Collaborator

There are a lot of questions here so I'll do my best to answer as much as I can.

Before when we had Auto and Manual modes (no Explicit), there was no way to just control sharding propagation and leave partitioning to the compiler. Which is why shard_map was a good option to take control over everything.

But now with Explicit mode, you can take control over sharding propagation leaving partitioning of the computation to XLA. For FSDP, this ends up being enough without needing to use shard_map but there is nothing wrong with dropping into full shmap mode if you want full control. There are complicated FSDP cases where shmap helps more if you want to take over control of communication/compute overlap and schedule them as you please.

These kinds of decisions are very subjective and up to the taste of the user which is why JAX doesn't enforce any opinion here. We do have a bias towards using Explicit and Manual mode only and dropping into Auto mode where required instead of being Auto by default. So I would say, use what works for you and what you are comfortable with :) You can mix and match all 3 modes as you please.

the blocks will always be laid out according to in_specs?

Yes, that's correct.

hen why change the partitionspec?

Maybe we should. That might just be a typo/bug.

Does this help?

0 replies

AakashKumarNain · 2025-09-06T06:09:25Z

AakashKumarNain
Sep 6, 2025
Author

But now with Explicit mode, you can take control over sharding propagation leaving partitioning of the computation to XLA. For FSDP, this ends up being enough without needing to use shard_map but there is nothing wrong with dropping into full shmap mode if you want full control.

Yes, yes! I know that there is a much cleaner mental model to mix explicit and shard_map. But as I said, I couldn't find any examples on demonstrating the best practices for that. I like the auot_axes API a lot, but it's still unclear how to mix with shmap or how to get most of it in a code meant to run at scale. This is why I took the classifier example above. If you can comment on that, it would be help clarify things on mixing the APIs.

which is why JAX doesn't enforce any opinion here. We do have a bias towards using Explicit and Manual mode only and dropping into Auto mode where required instead of being Auto by default.

Agreed, but I am not saying you enforce it on everyone. All I am asking is your opinion for a simple example I provided above.

Maybe we should. That might just be a typo/bug.

Yeah, maybe we should rewrite that example. Because it does not make sense to shard the arrays in one way (especially when you know what you want to do), and then flip the partitionspec in shmap. It will end up confusing a lot of people. Also, can you please comment on the 8-way mesh-product sharding in that example which is a bit different from the normal fsdp-tp?

1 reply

AakashKumarNain Sep 11, 2025
Author

Any comments on this @yashk2810 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The choice of sharding APIs #31597

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The choice of sharding APIs #31597

Uh oh!

AakashKumarNain Sep 5, 2025

Replies: 2 comments · 1 reply

Uh oh!

yashk2810 Sep 5, 2025 Collaborator

Uh oh!

AakashKumarNain Sep 6, 2025 Author

Uh oh!

AakashKumarNain Sep 11, 2025 Author

AakashKumarNain
Sep 5, 2025

Replies: 2 comments 1 reply

yashk2810
Sep 5, 2025
Collaborator

AakashKumarNain
Sep 6, 2025
Author

AakashKumarNain Sep 11, 2025
Author