Skip to content

Commit bfa782d

Browse files
committed
[model] fix qwen3_moe_thinking template (#5116)
1 parent b610cfa commit bfa782d

File tree

9 files changed

+33
-10
lines changed

9 files changed

+33
-10
lines changed

docs/source/Instruction/命令行参数.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,12 @@
5555
- dataset_shuffle: 是否对dataset进行随机操作。默认为True。
5656
- 注意:CPT/SFT的随机包括两个部分:数据集的随机,由`dataset_shuffle`控制;train_dataloader中的随机,由`train_dataloader_shuffle`控制。
5757
- val_dataset_shuffle: 是否对val_dataset进行随机操作。默认为False。
58-
- 🔥streaming: 流式读取并处理数据集,默认False。通常在处理大型数据集时,设置为True。
58+
- 🔥streaming: 流式读取并处理数据集,默认False。
5959
- 注意:需要额外设置`--max_steps`,因为流式数据集无法获得其长度。你可以通过设置`--save_strategy epoch`并设置较大的max_steps来实现与`--num_train_epochs`等效的训练。或者,你也可以设置`max_epochs`确保训练到对应epochs时退出训练,并对权重进行验证和保存。
60+
- 注意:流式数据集可以跳过预处理等待,将预处理时间与训练时间重叠。流式数据集的预处理只在rank0上进行,并通过数据分发的方式同步到其他进程,其通常效率不如非流式数据集采用的数据分片读取方式。当训练的world_size较大时,预处理和数据分发将成为训练瓶颈。
6061
- interleave_prob: 默认值为 None。在组合多个数据集时,默认使用 `concatenate_datasets` 函数;如果设置了该参数,则会使用 `interleave_datasets` 函数。该参数通常用于流式数据集的组合,并会作为参数传入 `interleave_datasets` 函数中。
6162
- stopping_strategy: 可选为"first_exhausted", "all_exhausted",默认为"first_exhausted"。传入interleave_datasets函数中。
62-
- shuffle_buffer_size: 该参数用于指定流式数据集的随机buffer大小,默认为1000。
63+
- shuffle_buffer_size: 该参数用于指定流式数据集的随机buffer大小,默认为1000。该参数只在`dataset_shuffle`设置为true时有效。
6364
- download_mode: 数据集下载模式,包含`reuse_dataset_if_exists``force_redownload`,默认为reuse_dataset_if_exists。
6465
- columns: 用于对数据集进行列映射,使数据集满足AutoPreprocessor可以处理的样式,具体查看[这里](../Customization/自定义数据集.md)。你可以传入json字符串,例如:`'{"text1": "query", "text2": "response"}'`,代表将数据集中的"text1"映射为"query","text2"映射为"response",而query-response格式可以被AutoPreprocessor处理。默认为None。
6566
- strict: 如果为True,则数据集只要某行有问题直接抛错,否则会丢弃出错数据样本。默认False。

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -214,11 +214,11 @@
214214
|[swift/Qwen3-235B-A22B-AWQ](https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[cognitivecomputations/Qwen3-235B-A22B-AWQ](https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ)|
215215
|[Qwen/Qwen3-235B-A22B-Instruct-2507](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)|
216216
|[Qwen/Qwen3-235B-A22B-Instruct-2507-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[Qwen/Qwen3-235B-A22B-Instruct-2507-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8)|
217-
|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)|
218-
|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|
219217
|[swift/Qwen3-235B-A22B-Instruct-2507-AWQ](https://modelscope.cn/models/swift/Qwen3-235B-A22B-Instruct-2507-AWQ)|qwen3_moe|qwen3|transformers>=4.51|✘|-|-|
220218
|[Qwen/Qwen3-Coder-480B-A35B-Instruct](https://modelscope.cn/models/Qwen/Qwen3-Coder-480B-A35B-Instruct)|qwen3_moe|qwen3|transformers>=4.51|✔|coding|[Qwen/Qwen3-Coder-480B-A35B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct)|
221219
|[Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8](https://modelscope.cn/models/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8)|qwen3_moe|qwen3|transformers>=4.51|✘|coding|[Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8)|
220+
|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507)|qwen3_moe_thinking|qwen3_thinking|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)|
221+
|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|qwen3_moe_thinking|qwen3_thinking|transformers>=4.51|✘|-|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|
222222
|[Qwen/Qwen3-Embedding-0.6B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)|
223223
|[Qwen/Qwen3-Embedding-4B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)|
224224
|[Qwen/Qwen3-Embedding-8B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)|

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,12 @@ Hints:
5555
- dataset_shuffle: Whether to shuffle the dataset. Defaults to True.
5656
- Note: The shuffling in CPT/SFT consists of two parts: dataset shuffling, controlled by `dataset_shuffle`; and shuffling in the train_dataloader, controlled by `train_dataloader_shuffle`.
5757
- val_dataset_shuffle: Whether to perform shuffling on the val_dataset. Default is False.
58-
- 🔥streaming: Stream reading and processing of the dataset, default is False. It is typically set to True when handling large datasets.
58+
- 🔥streaming: Stream reading and processing of the dataset, default is False.
5959
- Note: You need to set `--max_steps` explicitly, as the streaming dataset does not have a defined length. You can achieve training equivalent to `--num_train_epochs` by setting `--save_strategy epoch` and specifying a sufficiently large `max_steps`. Alternatively, you can set `max_epochs` to ensure training exits after the corresponding number of epochs, at which point the model weights will be validated and saved.
60+
- Note: Streaming datasets can skip preprocessing wait time by overlapping preprocessing with training. Preprocessing for streaming datasets is performed only on rank 0 and then synchronized to other processes via data distribution. This approach is generally less efficient than the data sharding and reading method used by non-streaming datasets. When the world size is large, preprocessing and data distribution can become a training bottleneck.
6061
- interleave_prob: Defaults to None. When combining multiple datasets, the `concatenate_datasets` function is used by default. If this parameter is set, the `interleave_datasets` function will be used instead. This parameter is typically used when combining streaming datasets and is passed to the `interleave_datasets` function.
6162
- stopping_strategy: Can be either "first_exhausted" or "all_exhausted", with the default being "first_exhausted". This parameter is passed to the `interleave_datasets` function.
62-
- shuffle_buffer_size: This parameter is used to specify the shuffle buffer size for streaming datasets. Defaults to 1000.
63+
- shuffle_buffer_size: This parameter is used to specify the shuffle buffer size for streaming datasets. Defaults to 1000. This parameter is only effective when `dataset_shuffle` is set to true.
6364
- download_mode: Dataset download mode, including `reuse_dataset_if_exists` and `force_redownload`, default is reuse_dataset_if_exists.
6465
- columns: Used for column mapping of the dataset to ensure that the dataset conforms to the format that AutoPreprocessor can handle. For more details, see [here](../Customization/Custom-dataset.md). You can pass in a JSON string, for example: `'{"text1": "query", "text2": "response"}'`, which means mapping "text1" in the dataset to "query" and "text2" to "response". The query-response format can be processed by the AutoPreprocessor. The default value is None.
6566
- strict: If set to True, any row with an issue in the dataset will throw an error immediately, otherwise, erroneous data samples will be discarded. Default is False.

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -214,11 +214,11 @@ The table below introduces the models integrated with ms-swift:
214214
|[swift/Qwen3-235B-A22B-AWQ](https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[cognitivecomputations/Qwen3-235B-A22B-AWQ](https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ)|
215215
|[Qwen/Qwen3-235B-A22B-Instruct-2507](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)|
216216
|[Qwen/Qwen3-235B-A22B-Instruct-2507-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[Qwen/Qwen3-235B-A22B-Instruct-2507-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8)|
217-
|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)|
218-
|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|
219217
|[swift/Qwen3-235B-A22B-Instruct-2507-AWQ](https://modelscope.cn/models/swift/Qwen3-235B-A22B-Instruct-2507-AWQ)|qwen3_moe|qwen3|transformers>=4.51|✘|-|-|
220218
|[Qwen/Qwen3-Coder-480B-A35B-Instruct](https://modelscope.cn/models/Qwen/Qwen3-Coder-480B-A35B-Instruct)|qwen3_moe|qwen3|transformers>=4.51|✔|coding|[Qwen/Qwen3-Coder-480B-A35B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct)|
221219
|[Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8](https://modelscope.cn/models/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8)|qwen3_moe|qwen3|transformers>=4.51|✘|coding|[Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8)|
220+
|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507)|qwen3_moe_thinking|qwen3_thinking|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)|
221+
|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|qwen3_moe_thinking|qwen3_thinking|transformers>=4.51|✘|-|[Qwen/Qwen3-235B-A22B-Thinking-2507-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8)|
222222
|[Qwen/Qwen3-Embedding-0.6B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)|
223223
|[Qwen/Qwen3-Embedding-4B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)|
224224
|[Qwen/Qwen3-Embedding-8B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)|

swift/llm/model/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ class LLMModelType:
1414
qwq = 'qwq'
1515
qwen3 = 'qwen3'
1616
qwen3_moe = 'qwen3_moe'
17+
qwen3_moe_thinking = 'qwen3_moe_thinking'
1718
qwen3_emb = 'qwen3_emb'
1819
qwen3_reranker = 'qwen3_reranker'
1920

swift/llm/model/model/qwen.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -555,8 +555,6 @@ def _get_cast_dtype(self) -> torch.dtype:
555555
ModelGroup([
556556
Model('Qwen/Qwen3-235B-A22B-Instruct-2507', 'Qwen/Qwen3-235B-A22B-Instruct-2507'),
557557
Model('Qwen/Qwen3-235B-A22B-Instruct-2507-FP8', 'Qwen/Qwen3-235B-A22B-Instruct-2507-FP8'),
558-
Model('Qwen/Qwen3-235B-A22B-Thinking-2507', 'Qwen/Qwen3-235B-A22B-Thinking-2507'),
559-
Model('Qwen/Qwen3-235B-A22B-Thinking-2507-FP8', 'Qwen/Qwen3-235B-A22B-Thinking-2507-FP8'),
560558
# awq
561559
Model('swift/Qwen3-235B-A22B-Instruct-2507-AWQ'),
562560
]),
@@ -572,6 +570,21 @@ def _get_cast_dtype(self) -> torch.dtype:
572570
requires=['transformers>=4.51'],
573571
))
574572

573+
register_model(
574+
ModelMeta(
575+
LLMModelType.qwen3_moe_thinking,
576+
[
577+
ModelGroup([
578+
Model('Qwen/Qwen3-235B-A22B-Thinking-2507', 'Qwen/Qwen3-235B-A22B-Thinking-2507'),
579+
Model('Qwen/Qwen3-235B-A22B-Thinking-2507-FP8', 'Qwen/Qwen3-235B-A22B-Thinking-2507-FP8'),
580+
]),
581+
],
582+
TemplateType.qwen3_thinking,
583+
get_model_tokenizer_with_flash_attn,
584+
architectures=['Qwen3MoeForCausalLM'],
585+
requires=['transformers>=4.51'],
586+
))
587+
575588

576589
def patch_qwen_vl_utils(vision_process):
577590
if hasattr(vision_process, '_patch'):

swift/llm/template/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ class LLMTemplateType:
1313
qwen2_5_math = 'qwen2_5_math'
1414
qwen2_5_math_prm = 'qwen2_5_math_prm'
1515
qwen3 = 'qwen3'
16+
qwen3_thinking = 'qwen3_thinking'
1617
qwen3_emb = 'qwen3_emb'
1718
qwen3_reranker = 'qwen3_reranker'
1819
qwq_preview = 'qwq_preview'

swift/llm/template/template/qwen.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,11 @@ class Qwen2_5MathTemplateMeta(QwenTemplateMeta):
5252
# '<think>\n\n</think>\n\n'
5353
register_template(QwenTemplateMeta(LLMTemplateType.qwen3, default_system=None, template_cls=ThinkingTemplate))
5454

55+
register_template(
56+
QwenTemplateMeta(
57+
LLMTemplateType.qwen3_thinking, default_system=None, response_prefix='<think>\n',
58+
template_cls=ThinkingTemplate))
59+
5560

5661
class Qwen3RerankerTemplate(Template):
5762
instruction = 'Given a web search query, retrieve relevant passages that answer the query'

swift/megatron/model/gpt/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
ModelType.qwen3,
3838
ModelType.qwen2_moe,
3939
ModelType.qwen3_moe,
40+
ModelType.qwen3_moe_thinking,
4041
ModelType.internlm3,
4142
ModelType.mimo,
4243
ModelType.mimo_rl,

0 commit comments

Comments
 (0)