Open
Conversation
|
Thanks for your contribution! |
Contributor
Author
|
/re-run all-failed |
054248e to
546b763
Compare
Contributor
Author
|
/re-run all-failed |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #3984 +/- ##
==========================================
Coverage ? 33.69%
==========================================
Files ? 453
Lines ? 86319
Branches ? 0
==========================================
Hits ? 29081
Misses ? 57238
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
546b763 to
1678630
Compare
Muon optimizer integration: - Create Muon optimizer in trainer when `optim=muon`, with per-head QKV metadata annotation for fused QKV weight orthogonalisation - Handle Muon's `_moment_acc_str` (vs AdamW's `_moment1_acc_str`) in optimizer state save/restore - Add Muon `_muon_update`/`_apply_optimize` offload support in `offload_optimizer.py` ShardingV3 support: - Add `sharding_v3` training argument and `FLAGS_sharding_v3` environment variable dispatch - Implement `DygraphShardingOptimizerV3` init path in `trainer_utils.py` - Add V3 reshard logic (`reshard/sharding_v3.py`) for checkpoint save/restore - Adapt `sharding_io.py`, `zero_cost_checkpoint.py`, and `moe_hybrid_parallel_optimizer.py` for V3 optimizer unwrapping Tests: - Add Muon smoke tests (`tests/muon/`) exercising both V2 and V3 sharding paths on 2 GPUs with AMP O2
1678630 to
7db2abc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Before submitting
testsfolder. If there are codecov issues, please add tests cases first.PR types
New features
PR changes
Others
Description
Integrate the Muon optimizer into PaddleFormers trainer and add ShardingV3
distributed training support.
Muon optimizer integration
paddle.optimizer.Muonwhenoptim=muon; annotatefused QKV weights with per-head metadata (
needs_qkv_split,head_num,kv_head_num) for per-head orthogonalisation; handle Muon's_moment_acc_str(vs AdamW's_moment1_acc_str) in optimizer statesave/restore
OptimizerNames.MUONand Muon optimizerconstruction logic with default hyperparameters
muonas a valid optimizer choice_muon_updateand_apply_optimizefor CPU offload supportShardingV3 support
sharding_v3boolean argument, propagated viaFLAGS_sharding_v3environment variableDygraphShardingOptimizerV3initialisation pathsave/restore with full-parameter ownership model
SHARDING_STRATEGY_V3constantTests
tests/muon/test_muon_smoke.py: smoke tests exercising both ShardingV2 andShardingV3 code paths on 2 GPUs with AMP O2, validating loss is finite
across 3 training steps