Skip to content

Commit 6c130eb

Browse files
authored
[megatron] support gpt-oss (#6823)
1 parent 0679066 commit 6c130eb

File tree

22 files changed

+263
-56
lines changed

22 files changed

+263
-56
lines changed

docs/source/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -454,7 +454,7 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数,
454454
- 🔥create_checkpoint_symlink: 额外创建checkpoint软链接,方便书写自动化训练脚本。best_model和last_model的软链接路径分别为f'{output_dir}/best'和f'{output_dir}/last'。
455455
- 🔥packing: 将不同长度的数据样本打包成统一长度的样本,实现训练时各节点与进程的负载均衡(避免长文本拖慢短文本的训练速度),从而提高GPU利用率,保持显存占用稳定。当使用 `--attn_impl flash_attn` 时,可确保packed样本内的不同序列之间相互独立,互不可见。该参数默认为`False`,目前支持 CPT/SFT/DPO/KTO/GKD。注意:**packing会导致数据集样本数减少,请自行调节梯度累加数和学习率**
456456
- packing_length: packing的长度。默认为None,设置为max_length。
457-
- packing_num_proc: packing的进程数,默认为1。需要注意的是,不同的`packing_num_proc`,最终形成的packed数据集是不同的。(该参数在流式packing时不生效)
457+
- packing_num_proc: packing的进程数,默认为1。需要注意的是,不同的`packing_num_proc`,最终形成的packed数据集是不同的。(该参数在流式packing时不生效)。通常不需要修改该值,packing速度远快于tokenize速度。
458458
- lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(多模态模型则包括从磁盘中读取图片)。该参数默认为None,在LLM训练中默认为False,而MLLM训练默认为True,节约内存。
459459
- 注意:若你要进行图像的数据增强,你需要将lazy_tokenize(或streaming)设置为True,并修改Template类中的encode方法。
460460
- use_logits_to_keep: 通过在`forward`中根据labels传入logits_to_keep,减少无效logits的计算与存储,从而减少显存占用并加快训练速度。默认为None,进行自动选择。

docs/source/Instruction/Supported-models-and-datasets.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -336,8 +336,8 @@
336336
|[01ai/Yi-Coder-1.5B-Chat](https://modelscope.cn/models/01ai/Yi-Coder-1.5B-Chat)|yi_coder|yi_coder|-|✔|coding|[01-ai/Yi-Coder-1.5B-Chat](https://huggingface.co/01-ai/Yi-Coder-1.5B-Chat)|
337337
|[01ai/Yi-Coder-9B-Chat](https://modelscope.cn/models/01ai/Yi-Coder-9B-Chat)|yi_coder|yi_coder|-|✔|coding|[01-ai/Yi-Coder-9B-Chat](https://huggingface.co/01-ai/Yi-Coder-9B-Chat)|
338338
|[SUSTC/SUS-Chat-34B](https://modelscope.cn/models/SUSTC/SUS-Chat-34B)|sus|sus|-|✔|-|[SUSTech/SUS-Chat-34B](https://huggingface.co/SUSTech/SUS-Chat-34B)|
339-
|[openai-mirror/gpt-oss-20b](https://modelscope.cn/models/openai-mirror/gpt-oss-20b)|gpt_oss|gpt_oss|transformers>=4.55|✘|-|[openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)|
340-
|[openai-mirror/gpt-oss-120b](https://modelscope.cn/models/openai-mirror/gpt-oss-120b)|gpt_oss|gpt_oss|transformers>=4.55|✘|-|[openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)|
339+
|[openai-mirror/gpt-oss-20b](https://modelscope.cn/models/openai-mirror/gpt-oss-20b)|gpt_oss|gpt_oss|transformers>=4.55|✔|-|[openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)|
340+
|[openai-mirror/gpt-oss-120b](https://modelscope.cn/models/openai-mirror/gpt-oss-120b)|gpt_oss|gpt_oss|transformers>=4.55|✔|-|[openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)|
341341
|[ByteDance-Seed/Seed-OSS-36B-Instruct](https://modelscope.cn/models/ByteDance-Seed/Seed-OSS-36B-Instruct)|seed_oss|seed_oss|transformers>=4.56|✘|-|[ByteDance-Seed/Seed-OSS-36B-Instruct](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct)|
342342
|[ByteDance-Seed/Seed-OSS-36B-Base](https://modelscope.cn/models/ByteDance-Seed/Seed-OSS-36B-Base)|seed_oss|seed_oss|transformers>=4.56|✘|-|[ByteDance-Seed/Seed-OSS-36B-Base](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base)|
343343
|[ByteDance-Seed/Seed-OSS-36B-Base-woSyn](https://modelscope.cn/models/ByteDance-Seed/Seed-OSS-36B-Base-woSyn)|seed_oss|seed_oss|transformers>=4.56|✘|-|[ByteDance-Seed/Seed-OSS-36B-Base-woSyn](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base-woSyn)|

docs/source/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,13 +161,17 @@
161161
- num_attention_heads: transformer attention heads的个数,默认为None。
162162
- group_query_attention: 默认为None。若`num_query_groups>1`,group_query_attention设置为True,否则为False。
163163
- num_query_groups: 默认为1。
164+
- softmax_type: 用于注意力机制的 softmax 类型。支持固定偏移和可学习偏移两种方式。可选项为'vanilla'、'off-by-one'和'learnable',默认为'vanilla'。
164165
- max_position_embeddings: 位置编码的最大长度,默认为None。
165166
- position_embedding_type: 位置编码的类型,可选为'learned_absolute'、'rope'、'mrope'、'relative'和'none',默认为'rope'。
166167
- rotary_base: 默认为10000。
167168
- rotary_percent: 默认为1.。
168169
- normalization: 可选为'LayerNorm', 'RMSNorm',默认为RMSNorm。
169170
- norm_epsilon: 默认为1e-5。
170171
- swiglu: 使用swiglu替代默认的gelu。默认为True。
172+
- quick_geglu: 使用快速 geglu 激活函数而不是默认的 gelu。默认为False。
173+
- activation_func_clamp_value: 限制激活函数中 linear_fc1 的输出值范围。仅在 `activation_func``quick_gelu` 时使用。默认为None。
174+
- glu_linear_offset: GLU 激活函数中的偏移项:`activation_func(x[0]) * (x[1] + offset)`。仅在 gated_linear_unit 为 True 时使用。默认为0.。
171175
- untie_embeddings_and_output_weights: 解开embedding和输出权重的绑定,默认为True。
172176
- disable_bias_linear: 禁用linear层的bias。默认为True。
173177
- add_qkv_bias: 仅在QKV的linear中增加bias,默认为True。
@@ -282,7 +286,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
282286
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。该参数只对`vit_gradient_checkpointing`生效。
283287
- 🔥packing: 将不同长度的数据样本打包成统一长度的样本,实现训练时各节点与进程的负载均衡(避免长文本拖慢短文本的训练速度),从而提高GPU利用率,保持显存占用稳定。当使用 `--attention_backend flash` 时,可确保packed样本内的不同序列之间相互独立,互不可见(除Qwen3-Next,因为含有linear-attention)。该参数默认为`False`。Megatron-SWIFT的所有训练任务都支持该参数。注意:**packing会导致数据集样本数减少,请自行调节梯度累加数和学习率**
284288
- packing_length: packing的长度。默认为None,设置为max_length。
285-
- packing_num_proc: packing的进程数,默认为1。需要注意的是,不同的`packing_num_proc`,最终形成的packed数据集是不同的。(该参数在流式packing时不生效)
289+
- packing_num_proc: packing的进程数,默认为1。需要注意的是,不同的`packing_num_proc`,最终形成的packed数据集是不同的。(该参数在流式packing时不生效)。通常不需要修改该值,packing速度远快于tokenize速度。
286290
- streaming: 流式读取并处理数据集,默认False。(流式数据集的随机并不彻底,可能导致loss波动剧烈。)
287291
- 注意:因为流式数据集无法获得其长度,因此需要设置`--train_iters`参数。设置`max_epochs`参数确保训练到对应epochs时退出训练,并对权重进行验证和保存。
288292
- 注意:流式数据集可以跳过预处理等待,将预处理时间与训练时间重叠。流式数据集的预处理只在rank0上进行,并通过数据分发的方式同步到其他进程,**其通常效率不如非流式数据集采用的数据分片读取方式**。当训练的world_size较大时,预处理和数据分发将成为训练瓶颈。

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -463,7 +463,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
463463
- 🔥create_checkpoint_symlink: Creates additional checkpoint symlinks to facilitate writing automated training scripts. The symlink paths for `best_model` and `last_model` are `f'{output_dir}/best'` and `f'{output_dir}/last'` respectively.
464464
- 🔥packing: Packs data samples of varying lengths into samples of uniform length, achieving load balancing across nodes and processes during training (preventing long texts from slowing down short text training), thereby improving GPU utilization and maintaining stable memory usage. When using `--attn_impl flash_attn`, it ensures that different sequences within packed samples remain independent and invisible to each other. This parameter defaults to `False` and currently supports CPT/SFT/DPO/KTO/GKD. Note: **packing will reduce the number of dataset samples, please adjust gradient accumulation steps and learning rate accordingly**.
465465
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
466-
- packing_num_proc: Number of processes for packing, default is 1. Note that different values of `packing_num_proc` will result in different packed datasets. (This parameter does not take effect during streaming packing)
466+
- packing_num_proc: Number of processes for packing, default is 1. Note that different values of `packing_num_proc` will result in different packed datasets. (This parameter does not take effect during streaming packing). Usually there is no need to modify this value, as packing speed is much faster than tokenization speed.
467467
- lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
468468
- Note: If you want to perform image data augmentation, you need to set `lazy_tokenize` (or `streaming`) to True and modify the `encode` method in the Template class.
469469
- use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -336,8 +336,8 @@ The table below introduces the models integrated with ms-swift:
336336
|[01ai/Yi-Coder-1.5B-Chat](https://modelscope.cn/models/01ai/Yi-Coder-1.5B-Chat)|yi_coder|yi_coder|-|✔|coding|[01-ai/Yi-Coder-1.5B-Chat](https://huggingface.co/01-ai/Yi-Coder-1.5B-Chat)|
337337
|[01ai/Yi-Coder-9B-Chat](https://modelscope.cn/models/01ai/Yi-Coder-9B-Chat)|yi_coder|yi_coder|-|✔|coding|[01-ai/Yi-Coder-9B-Chat](https://huggingface.co/01-ai/Yi-Coder-9B-Chat)|
338338
|[SUSTC/SUS-Chat-34B](https://modelscope.cn/models/SUSTC/SUS-Chat-34B)|sus|sus|-|✔|-|[SUSTech/SUS-Chat-34B](https://huggingface.co/SUSTech/SUS-Chat-34B)|
339-
|[openai-mirror/gpt-oss-20b](https://modelscope.cn/models/openai-mirror/gpt-oss-20b)|gpt_oss|gpt_oss|transformers>=4.55|✘|-|[openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)|
340-
|[openai-mirror/gpt-oss-120b](https://modelscope.cn/models/openai-mirror/gpt-oss-120b)|gpt_oss|gpt_oss|transformers>=4.55|✘|-|[openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)|
339+
|[openai-mirror/gpt-oss-20b](https://modelscope.cn/models/openai-mirror/gpt-oss-20b)|gpt_oss|gpt_oss|transformers>=4.55|✔|-|[openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)|
340+
|[openai-mirror/gpt-oss-120b](https://modelscope.cn/models/openai-mirror/gpt-oss-120b)|gpt_oss|gpt_oss|transformers>=4.55|✔|-|[openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)|
341341
|[ByteDance-Seed/Seed-OSS-36B-Instruct](https://modelscope.cn/models/ByteDance-Seed/Seed-OSS-36B-Instruct)|seed_oss|seed_oss|transformers>=4.56|✘|-|[ByteDance-Seed/Seed-OSS-36B-Instruct](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct)|
342342
|[ByteDance-Seed/Seed-OSS-36B-Base](https://modelscope.cn/models/ByteDance-Seed/Seed-OSS-36B-Base)|seed_oss|seed_oss|transformers>=4.56|✘|-|[ByteDance-Seed/Seed-OSS-36B-Base](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base)|
343343
|[ByteDance-Seed/Seed-OSS-36B-Base-woSyn](https://modelscope.cn/models/ByteDance-Seed/Seed-OSS-36B-Base-woSyn)|seed_oss|seed_oss|transformers>=4.56|✘|-|[ByteDance-Seed/Seed-OSS-36B-Base-woSyn](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base-woSyn)|

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,13 +172,17 @@ For guidance on selecting parallelization strategies, please refer to the [Train
172172
- num_attention_heads: Number of transformer attention heads, default is None.
173173
- group_query_attention: Default is None. If `num_query_groups > 1`, group_query_attention is set to True, otherwise False.
174174
- num_query_groups: Default is 1.
175+
- softmax_type: The softmax type used for attention mechanism. Supports both fixed offset and learnable offset approaches. Options are 'vanilla', 'off-by-one', and 'learnable', default is 'vanilla'.
175176
- max_position_embeddings: Maximum length of positional embeddings, default is None.
176177
- position_embedding_type: Type of positional embedding, options are 'learned_absolute', 'rope', 'mrope', 'relative', and 'none'. Default is 'rope'.
177178
- rotary_base: Default is 10000.
178179
- rotary_percent: Default is 1.
179180
- normalization: Options are 'LayerNorm', 'RMSNorm'. Default is RMSNorm.
180181
- norm_epsilon: Default is 1e-5.
181182
- swiglu: Uses swiglu instead of the default gelu. Default is True.
183+
- quick_geglu: Use quick geglu activation instead of default gelu. Default is False.
184+
- activation_func_clamp_value: Clamp the output value range of linear_fc1 in the activation function. Only used when `activation_func` is `quick_gelu`. Default is None.
185+
- glu_linear_offset: Offset term in the GLU activation function: `activation_func(x[0]) * (x[1] + offset)`. Only used when gated_linear_unit is True. Default is 0.
182186
- untie_embeddings_and_output_weights: Unties embedding and output weights. Default is True.
183187
- disable_bias_linear: Disables bias in linear layers. Default is True.
184188
- add_qkv_bias: Adds bias only to QKV linear layers. Default is True.
@@ -301,7 +305,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
301305
- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to `None`. This parameter only takes effect when `vit_gradient_checkpointing` is enabled.
302306
- 🔥packing: Packs data samples of varying lengths into samples of uniform length, achieving load balancing across nodes and processes during training (preventing long texts from slowing down short text training), thereby improving GPU utilization and maintaining stable memory usage. When using `--attention_backend flash`, it ensures that different sequences within packed samples remain independent and invisible to each other (except for Qwen3-Next, which contains linear-attention). This parameter defaults to `False`. All training tasks in Megatron-SWIFT support this parameter. Note: **packing will reduce the number of dataset samples, please adjust gradient accumulation steps and learning rate accordingly**.
303307
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
304-
- packing_num_proc: Number of processes for packing, default is 1. Note that different values of `packing_num_proc` will result in different packed datasets. (This parameter does not take effect during streaming packing)
308+
- packing_num_proc: Number of processes for packing, default is 1. Note that different values of `packing_num_proc` will result in different packed datasets. (This parameter does not take effect during streaming packing). Usually there is no need to modify this value, as packing speed is much faster than tokenization speed.
305309
- streaming: Stream data loading and processing, default is False. (The shuffling of streaming datasets is not thorough, which may lead to severe loss fluctuations.)
306310
- Note: Since the length of a streaming dataset cannot be determined, the `--train_iters` parameter must be set. Also set the `max_epochs` parameter to ensure training exits after the specified number of epochs, and to validate and save the model weights accordingly.
307311
- Note: Streaming datasets can skip preprocessing wait time by overlapping preprocessing with training. Preprocessing for streaming datasets is performed only on rank 0 and then synchronized to other processes via data distribution. **This is generally less efficient than the data sharding approach used in non-streaming datasets.** When the training world_size is large, preprocessing and data distribution can become a training bottleneck.

examples/models/gpt_oss/mcore.sh

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# 2 * 40GiB
2+
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
3+
NPROC_PER_NODE=2 \
4+
CUDA_VISIBLE_DEVICES=0,1 \
5+
megatron sft \
6+
--model openai-mirror/gpt-oss-20b \
7+
--load_safetensors true \
8+
--save_safetensors true \
9+
--merge_lora true \
10+
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
11+
'AI-ModelScope/alpaca-gpt4-data-en#500' \
12+
'swift/self-cognition#500' \
13+
--train_type lora \
14+
--lora_rank 8 \
15+
--lora_alpha 32 \
16+
--target_modules all-linear \
17+
--expert_model_parallel_size 2 \
18+
--moe_permute_fusion true \
19+
--moe_grouped_gemm true \
20+
--moe_shared_expert_overlap true \
21+
--moe_aux_loss_coeff 1e-6 \
22+
--micro_batch_size 8 \
23+
--global_batch_size 16 \
24+
--recompute_granularity full \
25+
--recompute_method uniform \
26+
--recompute_num_layers 1 \
27+
--max_epochs 1 \
28+
--finetune true \
29+
--cross_entropy_loss_fusion true \
30+
--lr 1e-4 \
31+
--lr_warmup_fraction 0.05 \
32+
--min_lr 1e-5 \
33+
--save megatron_output/gpt-oss-20b \
34+
--eval_interval 100 \
35+
--save_interval 100 \
36+
--max_length 2048 \
37+
--num_workers 4 \
38+
--dataset_num_proc 4 \
39+
--no_save_optim true \
40+
--no_save_rng true \
41+
--sequence_parallel true \
42+
--padding_free false \
43+
--attention_backend unfused \
44+
--model_author swift \
45+
--model_name swift-robot
46+
47+
# CUDA_VISIBLE_DEVICES=0 \
48+
# swift infer \
49+
# --model megatron_output/gpt-oss-20b/vx-xxx/checkpoint-xxx \
50+
# --stream true

0 commit comments

Comments
 (0)