[megatron] feat: enable Megatron FSDP for SFT training#5854
[megatron] feat: enable Megatron FSDP for SFT training#5854yxs wants to merge 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements support for Megatron FSDP (ZeRO-style sharding) by adding configuration fields, updating checkpointing to skip unsupported states, and introducing deferred model wrapping logic. Feedback points out a logic error in make_megatron_module where FSDP wrapping could be bypassed if wrap_with_ddp is disabled, recommending a refactor to ensure the configuration is correctly initialized when FSDP is enabled.
| ddp_config = None | ||
| if wrap_config.wrap_with_ddp: | ||
| ddp_config_dict = { | ||
| "use_distributed_optimizer": wrap_config.use_distributed_optimizer, | ||
| } | ||
| if override_ddp_config is not None: | ||
| ddp_config_dict.update(override_ddp_config) | ||
| ddp_config = ddp_config_dict | ||
|
|
||
| model = bridge.get_model( | ||
| post_model_creation_callbacks=post_model_creation_callbacks, | ||
| wrap_with_ddp=wrap_config.wrap_with_ddp, | ||
| fp16=tf_config.fp16, | ||
| bf16=tf_config.bf16, | ||
| ddp_config=ddp_config, | ||
| ) | ||
| ddp_config = _build_ddp_config_dict(wrap_config, override_ddp_config) | ||
|
|
||
| use_fsdp = hasattr(wrap_config, "use_megatron_fsdp") and wrap_config.use_megatron_fsdp | ||
| if use_fsdp and not HAVE_MEGATRON_FSDP: | ||
| raise ImportError( | ||
| "engine.use_megatron_fsdp=True requires megatron-fsdp package. " | ||
| "Install from Megatron-LM dev branch with FSDP support." | ||
| ) | ||
| if use_fsdp and wrap_config.wrap_with_ddp: | ||
| # FSDP wrapping deferred to after weight loading (mbridge can't parse FSDP structure) | ||
| model = bridge.get_model( | ||
| post_model_creation_callbacks=post_model_creation_callbacks, | ||
| wrap_with_ddp=False, | ||
| fp16=tf_config.fp16, | ||
| bf16=tf_config.bf16, | ||
| ddp_config=None, | ||
| ) | ||
| pending_fsdp_config = ddp_config | ||
| else: | ||
| model = bridge.get_model( | ||
| post_model_creation_callbacks=post_model_creation_callbacks, | ||
| wrap_with_ddp=wrap_config.wrap_with_ddp, | ||
| fp16=tf_config.fp16, | ||
| bf16=tf_config.bf16, | ||
| ddp_config=ddp_config, | ||
| ) | ||
| pending_fsdp_config = None |
There was a problem hiding this comment.
There's a potential issue where FSDP is silently not applied in certain configurations. Specifically, if use_megatron_fsdp is true but wrap_with_ddp is false (e.g., in forward_only mode), the current logic will not build the FSDP configuration, and the model will not be wrapped with FSDP.
This can lead to unexpected behavior where FSDP is enabled in the config but not actually used.
To fix this, the logic should be adjusted to build the ddp_config if either DDP or FSDP is enabled, and then decide on wrapping based on whether FSDP is being used. This ensures FSDP is correctly applied.
use_fsdp = hasattr(wrap_config, "use_megatron_fsdp") and wrap_config.use_megatron_fsdp
ddp_config = None
if wrap_config.wrap_with_ddp or use_fsdp:
ddp_config = _build_ddp_config_dict(wrap_config, override_ddp_config)
if use_fsdp and not HAVE_MEGATRON_FSDP:
raise ImportError(
"engine.use_megatron_fsdp=True requires megatron-fsdp package. "
"Install from Megatron-LM dev branch with FSDP support."
)
if use_fsdp:
# FSDP wrapping deferred to after weight loading (mbridge can't parse FSDP structure)
model = bridge.get_model(
post_model_creation_callbacks=post_model_creation_callbacks,
wrap_with_ddp=False,
fp16=tf_config.fp16,
bf16=tf_config.bf16,
ddp_config=None,
)
pending_fsdp_config = ddp_config
else:
model = bridge.get_model(
post_model_creation_callbacks=post_model_creation_callbacks,
wrap_with_ddp=wrap_config.wrap_with_ddp,
fp16=tf_config.fp16,
bf16=tf_config.bf16,
ddp_config=ddp_config,
)
pending_fsdp_config = NoneThere was a problem hiding this comment.
skip FSDP when not training, wrap_with_ddp=False only applies to ref models and forward-only inference, which don't have backward/optimizer.
verl/workers/megatron_workers.py
Outdated
| share_embeddings_and_output_weights=self.share_embeddings_and_output_weights, | ||
| wrap_with_ddp=True, | ||
| use_distributed_optimizer=self.config.actor.megatron.use_distributed_optimizer, | ||
| use_distributed_optimizer=megatron_config.use_distributed_optimizer, |
There was a problem hiding this comment.
megatron_workers.py has been deprecated, please do not modify it.
Enable Megatron-LM's native FullyShardedDataParallel in verl's Megatron engine, allowing ZeRO-3 parameter/gradient/optimizer state sharding via engine.use_megatron_fsdp=True. Uses deferred FSDP wrapping to maintain compatibility with mbridge weight loading. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sync auto-generated config with new FSDP fields in megatron.yaml. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4552820 to
3a6b47f
Compare
|
@yxs CI failed, please fix it |
|
@wuxibin89 All 6 CI failures are unrelated to this PR.
|
|
Hold until ci with new engine work pass( |
What does this PR do?
Enable Megatron-LM's native
FullyShardedDataParallel(FSDP) in verl's Megatron engine for SFT training. This allows ZeRO-style parameter/gradient/optimizer state sharding across data-parallel ranks, reducing per-GPU memory usage for large model training.Related issue: #5836 (Q2 Roadmap — Megatron FSDP enabling)
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Tested on 8×H100 80GB with Qwen3-1.7B-Base, GSM8K SFT dataset, 1 epoch (77 steps).
DDP baseline vs FSDP ZeRO-3:
API and Usage Example
torchrun --nproc_per_node=8 --nnodes=1 -m verl.trainer.sft_trainer \ engine=megatron \ engine.use_mbridge=True \ engine.use_megatron_fsdp=True \ engine.megatron_fsdp_zero_stage=3 \ engine.tensor_model_parallel_size=1 \ engine.pipeline_model_parallel_size=1 \ engine.dtype=bfloat16 \ model.path=<your_model_path> \ data.train_files=<your_data.parquet> \ data.train_batch_size=96 \ data.micro_batch_size_per_gpu=2 \ data.max_length=2048 \ optim=megatron \ optim.lr=2e-5 \ trainer.total_epochs=1New config fields in engine:
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always