Skip to content

Commit 0ca4a88

Browse files
authored
Update MoE training in example (#251)
1 parent 49ff522 commit 0ca4a88

File tree

2 files changed

+24
-4
lines changed

2 files changed

+24
-4
lines changed

docs/sphinx_doc/source/tutorial/example_megatron.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,10 @@ actor_rollout_ref:
140140
# Use mBridge for parameter import/export (optional)
141141
use_mbridge: false
142142

143+
# Use Megatron checkpoint
144+
use_dist_checkpointing: false
145+
dist_checkpointing_path: null
146+
143147
# Recomputation settings (helps save memory during training)
144148
override_transformer_config:
145149
recompute_granularity: full
@@ -155,6 +159,8 @@ actor_rollout_ref:
155159
grad_offload: false
156160
optimizer_offload: false
157161
use_mbridge: false
162+
use_dist_checkpointing: false
163+
dist_checkpointing_path: null
158164
override_transformer_config:
159165
recompute_granularity: full
160166
recompute_method: uniform
@@ -171,6 +177,8 @@ critic:
171177
grad_offload: false
172178
optimizer_offload: false
173179
use_mbridge: false
180+
use_dist_checkpointing: false
181+
dist_checkpointing_path: null
174182
override_transformer_config:
175183
recompute_granularity: full
176184
recompute_method: uniform
@@ -182,9 +190,14 @@ critic:
182190

183191
### Training Mixture-of-Experts (MoE) Models
184192

185-
If you're training an MoE model like **Qwen/Qwen3-30B-A3B**, you have two options:
193+
If you're training an MoE model like **Qwen/Qwen3-30B-A3B**, you’ll need to take one of the following two approaches to ensure it works properly:
194+
195+
1. **Use MBridge (Recommended)**:
196+
Simply set `use_mbridge: true` in your configuration file. This enables the necessary support for MoE models directly.
186197

187-
1. **Enable mBridge**: Set `use_mbridge: true` in the config.
188-
2. **Convert the model first**: Use the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** to convert your model before training.
198+
2. **Convert the model manually**:
199+
If you prefer not to use MBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config with:
200+
- `use_dist_checkpointing: true`
201+
- `dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/`
189202

190-
> ⚠️ Without one of these steps, MoE models may not load or train correctly.
203+
> ⚠️ Important: If you skip both steps, the MoE model may fail to load or train correctly. Make sure to follow one of the two options above.

examples/ppo_countdown_megatron/train_countdown.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ actor_rollout_ref:
1818
optimizer_offload: false
1919
# whether to use mbridge to import/export parameters
2020
use_mbridge: false
21+
# Use Megatron checkpoint
22+
use_dist_checkpointing: false
23+
dist_checkpointing_path: null
2124
# recompute settings
2225
override_transformer_config:
2326
recompute_granularity: full
@@ -48,6 +51,8 @@ actor_rollout_ref:
4851
grad_offload: false
4952
optimizer_offload: false
5053
use_mbridge: false
54+
use_dist_checkpointing: false
55+
dist_checkpointing_path: null
5156
override_transformer_config:
5257
recompute_granularity: full
5358
recompute_method: uniform
@@ -67,6 +72,8 @@ critic:
6772
grad_offload: false
6873
optimizer_offload: false
6974
use_mbridge: false
75+
use_dist_checkpointing: false
76+
dist_checkpointing_path: null
7077
override_transformer_config:
7178
recompute_granularity: full
7279
recompute_method: uniform

0 commit comments

Comments
 (0)