[FSDP2] enable per-param mesh FSDP2 for MoE by weifengpy · Pull Request #2281 · pytorch/torchtitan

weifengpy · 2026-01-28T08:37:53Z

command: NGPU=8 MODULE=deepseek_v3 CONFIG=deepseek_v3_16b ./run_train.sh --training.steps 20 --parallelism.expert-parallel-degree 4

fsdp2 support per-param mesh: pytorch/pytorch#173509

this PR applies fully_shard on transformer_block, sharding experts on edp_mesh, and other params on dp_mesh. FSDPModule schedule 2 all-gather sequentially: 1st on transformer blocks, 2nd on experts

def _shard_placement_fn(param: nn.Parameter) -> ShardPlacementResult:
    if param in expert_params:
        # Expert parameters: use Shard(1) on edp_mesh
        return ShardPlacementResult(
            placement=Shard(1), mesh_info=edp_mesh_info
        )
    else:
        # Non-expert parameters: use Shard(0) on dp_mesh
        return ShardPlacementResult(
            placement=Shard(0), mesh_info=dp_mesh_info
        )

this make it possible for apply torch.compile on each transformer_block. I didn't enable compile per block yet becuase there is still a gap in torch.compile + ac + MoE: #2341

AG order in forward are exactly the same before and after this change

AG order in backward are different but is better

Explicit Backward AllGather Order                                                                                                                                                                                         
  layers.7       @ 118.83ms   (attention/ffn params)                                                                 
  layers.6       @ 121.52ms   (attention/ffn params)                                                                 
  layers.6.moe   @ 122.04ms   (MoE expert params)                                                                    
  layers.7.moe   @ 125.81ms   (MoE expert params)  ← delayed!                                                        
                                                                                                                     
  Per-param Backward AllGather Order                                                                                 
  layers.7       @ 114.30ms   (first FSDP unit)                                                                      
  layers.7       @ 115.14ms   (second FSDP unit, includes MoE)                                                       
  layers.6       @ 117.42ms   (first FSDP unit)                                                                      
  layers.6       @ 117.89ms   (second FSDP unit, includes MoE)

Numerics remains bitwise equal with/without this change

 Loss Comparison                                                                                                                                                                                                                           
  ┌──────┬───────────────┬───────────────┬───────┐                                                                                                                                                                                          
  │ Step │ Old (0d93c63) │ New (e1c47c8) │ Match │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 1    │ 8.01151657    │ 8.01151657    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 5    │ 3.85572004    │ 3.85572004    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 10   │ 3.15517211    │ 3.15517211    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 15   │ 3.07873583    │ 3.07873583    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 20   │ 2.92206621    │ 2.92206621    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 25   │ 2.89102936    │ 2.89102936    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 30   │ 2.81201696    │ 2.81201696    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 35   │ 2.84123349    │ 2.84123349    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 40   │ 2.76206398    │ 2.76206398    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 45   │ 2.82969308    │ 2.82969308    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 50   │ 2.77560568    │ 2.77560568    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 55   │ 2.75578761    │ 2.75578761    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 60   │ 2.75143075    │ 2.75143075    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 65   │ 2.74203372    │ 2.74203372    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 70   │ 2.71638918    │ 2.71638918    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 75   │ 2.74999237    │ 2.74999237    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 80   │ 2.75584078    │ 2.75584078    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 85   │ 2.74837303    │ 2.74837303    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 90   │ 2.72101045    │ 2.72101045    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 95   │ 2.73645735    │ 2.73645735    │ ✓     │                                                                                                                                                                                          
  ├──────┼───────────────┼───────────────┼───────┤                                                                                                                                                                                          
  │ 100  │ 2.70604038    │ 2.70604038    │ ✓     │                                                                                                                                                                                          
  └──────┴───────────────┴───────────────┴───────┘

this PR applies fully_shard on transformer_block, sharding experts on edp_mesh, and other params on dp_mesh. FSDPModule schedule 2 all-gather sequentially: 1st on transformer blocks, 2nd on experts see torchtitan for AG/RS schedules and numeric experiments: pytorch/torchtitan#2281 existing fsdp2 callsite won't be affected because _shard_placement_fn -> ShardPlacementResult is a new code path [ghstack-poisoned]

this PR applies fully_shard on transformer_block, sharding experts on edp_mesh, and other params on dp_mesh. FSDPModule schedule 2 all-gather sequentially: 1st on transformer blocks, 2nd on experts see torchtitan for AG/RS schedules and numeric experiments: pytorch/torchtitan#2281 existing fsdp2 callsite won't be affected because _shard_placement_fn -> ShardPlacementResult is a new code path cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo [ghstack-poisoned]

this PR applies fully_shard on transformer_block, sharding experts on edp_mesh, and other params on dp_mesh. FSDPModule schedule 2 all-gather sequentially: 1st on transformer blocks, 2nd on experts see torchtitan for AG/RS schedules and numeric experiments: pytorch/torchtitan#2281 existing fsdp2 callsite won't be affected because _shard_placement_fn -> ShardPlacementResult is a new code path checked backward-compatiblibility * pytorch: fsdp2_mem_tracker.py is affected, but only if people use it with per-param mesh. I don't think it's a hard blocker * torchtitan: No usages of _fsdp_param_group (singular). Safe. * torchao: No usages of _fsdp_param_group (singular). Safe. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx kadeng chauhang amjames Lucaskabela jataylo [ghstack-poisoned]

weifengpy · 2026-02-10T23:39:11Z

those CI errors because we need to land pytorch/pytorch#173509 first

tianyu-l

SGTM, is there a plan for solving #2341?

Also, not sure if it's blocked by the issue, but we should modify apply_compile to remove the previous workaround fine-grained compilation code.

xmfan · 2026-02-10T23:54:51Z

@tianyu-l We're still gonna need the fine-grained workarounds at least for mxfp8, due to #2250 (comment). Until that issue is fixed (dynamo bwd tracing of autograd functions), we will trace wrong bwd graph.

tianyu-l · 2026-02-11T00:02:46Z

@xmfan oh, I didn't know. I marked it as high priority for now.

weifengpy · 2026-02-11T00:49:32Z

Also, not sure if it's blocked by the issue, but we should modify apply_compile to remove the previous workaround fine-grained compilation code.

I didn't change apply_compile because of #2341. #2250 (comment) is new to me

tianyu-l

sgtm

Add deepseek_v3_16b.toml config for local testing with 4 GPUs.

weifengpy requested review from fegin, tianyu-l, wconstab and wwwjn as code owners January 28, 2026 08:37

pytorch-bot bot added the ciflow/8gpu label Jan 28, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026

weifengpy marked this pull request as draft January 28, 2026 20:44

weifengpy changed the title ~~[FSDP2] enable per-param mesh FSDP2 for MoE and per-layer compile~~ [WIP][FSDP2] enable per-param mesh FSDP2 for MoE and per-layer compile Jan 28, 2026

weifengpy force-pushed the per-param-mesh branch from 345bbdd to 6e5dcf2 Compare January 29, 2026 23:13

weifengpy mentioned this pull request Feb 5, 2026

Apply #1895 only when really necessary #2322

Open

weifengpy force-pushed the per-param-mesh branch 6 times, most recently from 3c36e53 to 3bf7e27 Compare February 7, 2026 01:47

weifengpy changed the title ~~[WIP][FSDP2] enable per-param mesh FSDP2 for MoE and per-layer compile~~ [FSDP2] enable per-param mesh FSDP2 for MoE and per-layer compile Feb 7, 2026

weifengpy force-pushed the per-param-mesh branch from b5932fa to 3bf7e27 Compare February 7, 2026 03:15

weifengpy changed the title ~~[FSDP2] enable per-param mesh FSDP2 for MoE and per-layer compile~~ [FSDP2] enable per-param mesh FSDP2 for MoE Feb 7, 2026

weifengpy mentioned this pull request Feb 7, 2026

[dsv3] per-layer error when compile with MoE "HOP: Unsafe side effect" #2341

Open

tianyu-l mentioned this pull request Feb 7, 2026

[mxfp8 moe training] temp workaround: don't compile GroupedExperts #2268

Open

weifengpy mentioned this pull request Feb 8, 2026

[FSDP2] support per-param mesh pytorch/pytorch#173509

Closed

tianyu-l reviewed Feb 10, 2026

View reviewed changes

tianyu-l approved these changes Feb 11, 2026

View reviewed changes

weifengpy added 2 commits March 5, 2026 11:11

[DeepSeek-V3] Add 16B model config for testing

aec06d2

Add deepseek_v3_16b.toml config for local testing with 4 GPUs.

[FSDP2] Enable per-param mesh FSDP2 for MoE

2dd1a55

tianyu-l mentioned this pull request Mar 9, 2026

Expert Parallel Fast Path and Router Dtype Restoration (from solar-open 102B) #2225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP2] enable per-param mesh FSDP2 for MoE#2281

[FSDP2] enable per-param mesh FSDP2 for MoE#2281
weifengpy wants to merge 2 commits intopytorch:mainfrom
weifengpy:per-param-mesh

weifengpy commented Jan 28, 2026 •

edited

Loading

Uh oh!

weifengpy commented Feb 10, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

xmfan commented Feb 10, 2026 •

edited

Loading

Uh oh!

tianyu-l commented Feb 11, 2026

Uh oh!

weifengpy commented Feb 11, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

weifengpy commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weifengpy commented Feb 10, 2026

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

xmfan commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Feb 11, 2026

Uh oh!

weifengpy commented Feb 11, 2026

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weifengpy commented Jan 28, 2026 •

edited

Loading

xmfan commented Feb 10, 2026 •

edited

Loading