Skip to content

feat: support muon optimizer#3984

Open
xxyux wants to merge 1 commit intoPaddlePaddle:developfrom
xxyux:feature/add-muon-optimizer
Open

feat: support muon optimizer#3984
xxyux wants to merge 1 commit intoPaddlePaddle:developfrom
xxyux:feature/add-muon-optimizer

Conversation

@xxyux
Copy link
Copy Markdown
Contributor

@xxyux xxyux commented Mar 3, 2026

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Others

Description

Integrate the Muon optimizer into PaddleFormers trainer and add ShardingV3
distributed training support.

Muon optimizer integration

  • trainer.py: create paddle.optimizer.Muon when optim=muon; annotate
    fused QKV weights with per-head metadata (needs_qkv_split, head_num,
    kv_head_num) for per-head orthogonalisation; handle Muon's
    _moment_acc_str (vs AdamW's _moment1_acc_str) in optimizer state
    save/restore
  • trainer_utils.py: add OptimizerNames.MUON and Muon optimizer
    construction logic with default hyperparameters
  • training_args.py: register muon as a valid optimizer choice
  • offload_optimizer.py: monkey-patch Muon's _muon_update and
    _apply_optimize for CPU offload support

ShardingV3 support

  • training_args.py: add sharding_v3 boolean argument, propagated via
    FLAGS_sharding_v3 environment variable
  • trainer_utils.py: DygraphShardingOptimizerV3 initialisation path
  • reshard/sharding_v3.py (new): V3-specific checkpoint reshard logic for
    save/restore with full-parameter ownership model
  • reshard/common.py: add SHARDING_STRATEGY_V3 constant
  • sharding_io.py: adapt optimizer state unwrapping for V3
  • zero_cost_checkpoint.py: adapt EMA and buffer handling for V3
  • moe_hybrid_parallel_optimizer.py: V3 optimizer routing for MoE

Tests

  • tests/muon/test_muon_smoke.py: smoke tests exercising both ShardingV2 and
    ShardingV3 code paths on 2 GPUs with AMP O2, validating loss is finite
    across 3 training steps

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 3, 2026

Thanks for your contribution!

@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 4, 2026

/re-run all-failed

@xxyux xxyux force-pushed the feature/add-muon-optimizer branch 2 times, most recently from 054248e to 546b763 Compare March 4, 2026 12:04
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 9, 2026

/re-run all-failed

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 14.91228% with 194 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ed15c99). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddleformers/trainer/utils/reshard/sharding_v3.py 12.06% 51 Missing ⚠️
paddleformers/trainer/trainer_utils.py 9.09% 50 Missing ⚠️
paddleformers/trainer/utils/offload_optimizer.py 0.00% 35 Missing ⚠️
paddleformers/trainer/trainer.py 28.00% 18 Missing ⚠️
paddleformers/trainer/utils/sharding_io.py 22.22% 14 Missing ⚠️
...ddleformers/utils/moe_hybrid_parallel_optimizer.py 0.00% 8 Missing ⚠️
paddleformers/trainer/training_args.py 12.50% 7 Missing ⚠️
paddleformers/trainer/utils/reshard/common.py 41.66% 7 Missing ⚠️
...addleformers/trainer/utils/zero_cost_checkpoint.py 50.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #3984   +/-   ##
==========================================
  Coverage           ?   33.69%           
==========================================
  Files              ?      453           
  Lines              ?    86319           
  Branches           ?        0           
==========================================
  Hits               ?    29081           
  Misses             ?    57238           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xxyux xxyux force-pushed the feature/add-muon-optimizer branch from 546b763 to 1678630 Compare March 10, 2026 13:18
Muon optimizer integration:
- Create Muon optimizer in trainer when `optim=muon`, with per-head
  QKV metadata annotation for fused QKV weight orthogonalisation
- Handle Muon's `_moment_acc_str` (vs AdamW's `_moment1_acc_str`)
  in optimizer state save/restore
- Add Muon `_muon_update`/`_apply_optimize` offload support in
  `offload_optimizer.py`

ShardingV3 support:
- Add `sharding_v3` training argument and `FLAGS_sharding_v3`
  environment variable dispatch
- Implement `DygraphShardingOptimizerV3` init path in
  `trainer_utils.py`
- Add V3 reshard logic (`reshard/sharding_v3.py`) for checkpoint
  save/restore
- Adapt `sharding_io.py`, `zero_cost_checkpoint.py`, and
  `moe_hybrid_parallel_optimizer.py` for V3 optimizer unwrapping

Tests:
- Add Muon smoke tests (`tests/muon/`) exercising both V2 and V3
  sharding paths on 2 GPUs with AMP O2
@xxyux xxyux force-pushed the feature/add-muon-optimizer branch from 1678630 to 7db2abc Compare March 10, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants