Enable non-dim-0 FSDP sharding of MoE experts when ep=1 by aws-ritikadm · Pull Request #2668 · pytorch/torchtitan

aws-ritikadm · 2026-03-23T07:45:06Z

Summary

Previously, routed experts in MoE layers were only separately wrapped with fully_shard when ep_degree > 1. When ep_degree == 1, experts were sharded only as part of the outer TransformerBlock FSDP group, which meant the Shard(1) placement optimization (sharding on hidden_dim instead of num_experts) was never applied.

This PR extends the separate expert FSDP wrapping to also apply when ep_degree == 1. When the FSDP degree exceeds num_experts, experts are sharded on dim 1 (hidden_dim) to avoid padding inefficiency from dim-0 sharding — the same optimization that was already in place for ep > 1.

Validation

python -m pytest tests/unit_tests/test_fsdp_moe_sharding.py -v

collecting ... collected 3 items
tests/unit_tests/test_fsdp_moe_sharding.py::TestApplyFsdpMoESharding::test_no_ep_fsdp_gt_num_experts_shards_dim1 PASSED [ 33%]
tests/unit_tests/test_fsdp_moe_sharding.py::TestApplyFsdpMoESharding::test_no_ep_fsdp_le_num_experts_shards_dim0 PASSED [ 66%]
tests/unit_tests/test_fsdp_moe_sharding.py::TestApplyFsdpMoESharding::test_with_ep_fsdp_gt_num_experts_shards_dim1 PASSED [100%]

The three tests:

Test	Setup	What it checks
test_no_ep_fsdp_gt_num_experts_shards_dim1	ep=1, 4 experts, 8 FSDP ranks	8 > 4 → experts sharded on dim 1. This is the new code path that this change enables.
test_no_ep_fsdp_le_num_experts_shards_dim0	ep=1, 8 experts, 8 FSDP ranks	8 == 8 → no padding issue, experts sharded on dim 0 (default). Also exercises the new else branch but without triggering Shard(1).
test_with_ep_fsdp_gt_num_experts_shards_dim1	ep=2, 4 experts, edp mesh [efsdp=4, ep=2]	4*2=8 > 4 → experts sharded on dim 1. This is the pre-existing EP path, included for regression coverage.

tianyu-l

thanks for the fix, please address nit comments

tianyu-l · 2026-03-23T20:48:06Z

torchtitan/models/llama4/parallelize.py

+                experts_fsdp_config = fsdp_config.copy()
+                experts_fsdp_config["mesh"] = edp_mesh
+                assert edp_mesh is not None
+                fsdp_size = edp_mesh["efsdp"].size() * ep_degree


Suggested change

fsdp_size = edp_mesh["efsdp"].size() * ep_degree

efsdp_ep_size = edp_mesh["efsdp"].size() * ep_degree

tianyu-l · 2026-03-23T20:50:20Z

torchtitan/models/llama4/parallelize.py

+                fsdp_size = edp_mesh["efsdp"].size() * ep_degree
+            else:
+                experts_fsdp_config = fsdp_config
+                fsdp_size = fsdp_config["mesh"].size()


Suggested change

fsdp_size = fsdp_config["mesh"].size()

efsdp_ep_size = fsdp_config["mesh"].size()

resolved in 9482b89

tianyu-l · 2026-03-23T20:50:34Z

torchtitan/models/llama4/parallelize.py

-                edp_mesh["efsdp"].size() * ep_degree
-                > transformer_block.moe.experts.num_experts
-            ):
+            if fsdp_size > transformer_block.moe.experts.num_experts:


Suggested change

if fsdp_size > transformer_block.moe.experts.num_experts:

if efsdp_ep_size > transformer_block.moe.experts.num_experts:

resolved in 9482b89

tianyu-l · 2026-03-23T20:51:07Z

torchtitan/models/llama4/parallelize.py

+        #   inefficiency due to padding, so we shard on dim-1 (hidden_dim) instead.
+        if transformer_block.moe_enabled:
+            if ep_degree > 1:
+                experts_fsdp_config = fsdp_config.copy()


can also change this to efsdp_config to be concise and consistent, but up to you.

resolved in 9482b89

Enable non-dim-0 FSDP sharding of MoE experts when ep=1

5a84cf2

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 23, 2026

aws-ritikadm marked this pull request as ready for review March 23, 2026 07:45

aws-ritikadm requested review from fegin, tianyu-l, wconstab and wwwjn as code owners March 23, 2026 07:45

tianyu-l approved these changes Mar 23, 2026

View reviewed changes

fix: nits and lint

9482b89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable non-dim-0 FSDP sharding of MoE experts when ep=1#2668

Enable non-dim-0 FSDP sharding of MoE experts when ep=1#2668
aws-ritikadm wants to merge 2 commits intopytorch:mainfrom
aws-ritikadm:ep1-fsdp-sharding

aws-ritikadm commented Mar 23, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Mar 23, 2026

Uh oh!

tianyu-l Mar 23, 2026

Uh oh!

aws-ritikadm Mar 23, 2026

Uh oh!

tianyu-l Mar 23, 2026

Uh oh!

aws-ritikadm Mar 23, 2026

Uh oh!

tianyu-l Mar 23, 2026

Uh oh!

aws-ritikadm Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	fsdp_size = edp_mesh["efsdp"].size() * ep_degree
	efsdp_ep_size = edp_mesh["efsdp"].size() * ep_degree

	fsdp_size = fsdp_config["mesh"].size()
	efsdp_ep_size = fsdp_config["mesh"].size()

	if fsdp_size > transformer_block.moe.experts.num_experts:
	if efsdp_ep_size > transformer_block.moe.experts.num_experts:

Conversation

aws-ritikadm commented Mar 23, 2026

Summary

Validation

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

aws-ritikadm Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

aws-ritikadm Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

aws-ritikadm Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants