Skip to content

Commit ad90779

Browse files
authored
Support Qwen3 series (#4029)
1 parent 8d95f8c commit ad90779

File tree

13 files changed

+136
-14
lines changed

13 files changed

+136
-14
lines changed

docs/source/Instruction/Megatron-SWIFT训练.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11

22
# Megatron-SWIFT训练
33

4-
SWIFT引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行。支持Megatron训练的模型可以参考[支持的模型与数据集文档](./支持的模型和数据集.md)
4+
SWIFT引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3、Qwen3-MoE、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。完整支持的模型可以参考[支持的模型与数据集文档](./支持的模型和数据集.md)
55

66
## 环境准备
77
使用Megatron-SWIFT,除了安装swift依赖外,还需要安装以下内容:

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,22 @@
182182
|[Qwen/QwQ-32B-Preview](https://modelscope.cn/models/Qwen/QwQ-32B-Preview)|qwq_preview|qwq_preview|transformers>=4.37|✔|-|[Qwen/QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)|
183183
|[Qwen/QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B)|qwq|qwq|transformers>=4.37|✔|-|[Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)|
184184
|[Qwen/QwQ-32B-AWQ](https://modelscope.cn/models/Qwen/QwQ-32B-AWQ)|qwq|qwq|transformers>=4.37|✘|-|[Qwen/QwQ-32B-AWQ](https://huggingface.co/Qwen/QwQ-32B-AWQ)|
185+
|[Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)|
186+
|[Qwen/Qwen3-1.7B-Base](https://modelscope.cn/models/Qwen/Qwen3-1.7B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)|
187+
|[Qwen/Qwen3-4B-Base](https://modelscope.cn/models/Qwen/Qwen3-4B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)|
188+
|[Qwen/Qwen3-8B-Base](https://modelscope.cn/models/Qwen/Qwen3-8B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)|
189+
|[Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base)|
190+
|[Qwen/Qwen3-32B-Base](https://modelscope.cn/models/Qwen/Qwen3-32B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B-Base](https://huggingface.co/Qwen/Qwen3-32B-Base)|
191+
|[Qwen/Qwen3-0.6B](https://modelscope.cn/models/Qwen/Qwen3-0.6B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)|
192+
|[Qwen/Qwen3-1.7B](https://modelscope.cn/models/Qwen/Qwen3-1.7B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)|
193+
|[Qwen/Qwen3-4B](https://modelscope.cn/models/Qwen/Qwen3-4B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)|
194+
|[Qwen/Qwen3-8B](https://modelscope.cn/models/Qwen/Qwen3-8B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)|
195+
|[Qwen/Qwen3-14B](https://modelscope.cn/models/Qwen/Qwen3-14B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)|
196+
|[Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)|
197+
|[Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base)|
198+
|[Qwen/Qwen3-235B-A22B-Base](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Base](https://huggingface.co/Qwen/Qwen3-235B-A22B-Base)|
199+
|[Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)|
200+
|[Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)|
185201
|[iic/gte_Qwen2-1.5B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)|
186202
|[iic/gte_Qwen2-7B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-7B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)|
187203
|[codefuse-ai/CodeFuse-QWen-14B](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B)|codefuse_qwen|codefuse|-|✘|coding|[codefuse-ai/CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B)|

docs/source_en/Instruction/Megatron-SWIFT-Training.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11

22
# Megatron-SWIFT Training
33

4-
SWIFT incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and context parallelism. For models that support Megatron training, please refer to the [Supported Models and Datasets documentation](./Supported-models-and-datasets.md).
4+
SWIFT incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports the pre-training and fine-tuning of models such as Qwen3, Qwen3-MoE, Llama3, and the Deepseek-R1 distillation series. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](./Supported-models-and-datasets.md).
55

66
## Environment Setup
77

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,22 @@ The table below introduces the models integrated with ms-swift:
182182
|[Qwen/QwQ-32B-Preview](https://modelscope.cn/models/Qwen/QwQ-32B-Preview)|qwq_preview|qwq_preview|transformers>=4.37|✔|-|[Qwen/QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)|
183183
|[Qwen/QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B)|qwq|qwq|transformers>=4.37|✔|-|[Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)|
184184
|[Qwen/QwQ-32B-AWQ](https://modelscope.cn/models/Qwen/QwQ-32B-AWQ)|qwq|qwq|transformers>=4.37|✘|-|[Qwen/QwQ-32B-AWQ](https://huggingface.co/Qwen/QwQ-32B-AWQ)|
185+
|[Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)|
186+
|[Qwen/Qwen3-1.7B-Base](https://modelscope.cn/models/Qwen/Qwen3-1.7B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)|
187+
|[Qwen/Qwen3-4B-Base](https://modelscope.cn/models/Qwen/Qwen3-4B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)|
188+
|[Qwen/Qwen3-8B-Base](https://modelscope.cn/models/Qwen/Qwen3-8B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)|
189+
|[Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base)|
190+
|[Qwen/Qwen3-32B-Base](https://modelscope.cn/models/Qwen/Qwen3-32B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B-Base](https://huggingface.co/Qwen/Qwen3-32B-Base)|
191+
|[Qwen/Qwen3-0.6B](https://modelscope.cn/models/Qwen/Qwen3-0.6B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)|
192+
|[Qwen/Qwen3-1.7B](https://modelscope.cn/models/Qwen/Qwen3-1.7B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)|
193+
|[Qwen/Qwen3-4B](https://modelscope.cn/models/Qwen/Qwen3-4B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)|
194+
|[Qwen/Qwen3-8B](https://modelscope.cn/models/Qwen/Qwen3-8B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)|
195+
|[Qwen/Qwen3-14B](https://modelscope.cn/models/Qwen/Qwen3-14B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)|
196+
|[Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)|
197+
|[Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base)|
198+
|[Qwen/Qwen3-235B-A22B-Base](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Base](https://huggingface.co/Qwen/Qwen3-235B-A22B-Base)|
199+
|[Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)|
200+
|[Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)|
185201
|[iic/gte_Qwen2-1.5B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)|
186202
|[iic/gte_Qwen2-7B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-7B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)|
187203
|[codefuse-ai/CodeFuse-QWen-14B](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B)|codefuse_qwen|codefuse|-|✘|coding|[codefuse-ai/CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B)|
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# ZeRO3: 91.2s/it; 16 * 80GiB
2+
# Megatron-LM: 9.6s/it; 16 * 60GiB
3+
# Launch using Alibaba Cloud DLC
4+
# ref: https://github.com/modelscope/ms-swift/blob/main/examples/train/multi-node/dlc/train.sh
5+
NNODES=$WORLD_SIZE \
6+
NODE_RANK=$RANK \
7+
megatron sft \
8+
--load Qwen3-30B-A3B-Base-mcore \
9+
--dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
10+
--tensor_model_parallel_size 2 \
11+
--expert_model_parallel_size 8 \
12+
--moe_grouped_gemm true \
13+
--moe_shared_expert_overlap true \
14+
--moe_aux_loss_coeff 0.01 \
15+
--micro_batch_size 1 \
16+
--global_batch_size 16 \
17+
--packing true \
18+
--recompute_granularity full \
19+
--recompute_method uniform \
20+
--recompute_num_layers 1 \
21+
--train_iters 2000 \
22+
--eval_iters 50 \
23+
--finetune true \
24+
--cross_entropy_loss_fusion true \
25+
--lr 1e-5 \
26+
--lr_warmup_iters 100 \
27+
--min_lr 1e-6 \
28+
--save megatron_output/Qwen3-30B-A3B-Base \
29+
--eval_interval 200 \
30+
--save_interval 200 \
31+
--max_length 8192 \
32+
--num_workers 8 \
33+
--dataset_num_proc 8 \
34+
--no_save_optim true \
35+
--no_save_rng true \
36+
--sequence_parallel true \
37+
--use_flash_attn true

swift/llm/dataset/dataset/llm.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -829,9 +829,19 @@ def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
829829
return super().preprocess(row)
830830

831831

832+
class ThinkSelfCognitionPreprocessor(SelfCognitionPreprocessor):
833+
834+
def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
835+
row['response'] = '<think>\n\n</think>\n\n' + row['response']
836+
return super().preprocess(row)
837+
838+
832839
register_dataset(
833840
DatasetMeta(
834841
ms_dataset_id='swift/self-cognition',
835842
hf_dataset_id='modelscope/self-cognition',
836-
preprocess_func=SelfCognitionPreprocessor(),
843+
subsets=[
844+
SubsetDataset(preprocess_func=SelfCognitionPreprocessor()),
845+
SubsetDataset('think', preprocess_func=ThinkSelfCognitionPreprocessor()),
846+
],
837847
tags=['chat', 'self-cognition', '🔥']))

swift/llm/model/model/qwen.py

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -494,10 +494,22 @@ def _get_cast_dtype(self) -> torch.dtype:
494494
LLMModelType.qwen3,
495495
[
496496
ModelGroup([
497-
# Model('Qwen/Qwen3-0.6B-Base', 'Qwen/Qwen3-0.6B-Base'),
497+
Model('Qwen/Qwen3-0.6B-Base', 'Qwen/Qwen3-0.6B-Base'),
498+
Model('Qwen/Qwen3-1.7B-Base', 'Qwen/Qwen3-1.7B-Base'),
499+
Model('Qwen/Qwen3-4B-Base', 'Qwen/Qwen3-4B-Base'),
500+
Model('Qwen/Qwen3-8B-Base', 'Qwen/Qwen3-8B-Base'),
501+
Model('Qwen/Qwen3-14B-Base', 'Qwen/Qwen3-14B-Base'),
502+
Model('Qwen/Qwen3-32B-Base', 'Qwen/Qwen3-32B-Base'),
503+
# instruct
504+
Model('Qwen/Qwen3-0.6B', 'Qwen/Qwen3-0.6B'),
505+
Model('Qwen/Qwen3-1.7B', 'Qwen/Qwen3-1.7B'),
506+
Model('Qwen/Qwen3-4B', 'Qwen/Qwen3-4B'),
507+
Model('Qwen/Qwen3-8B', 'Qwen/Qwen3-8B'),
508+
Model('Qwen/Qwen3-14B', 'Qwen/Qwen3-14B'),
509+
Model('Qwen/Qwen3-32B', 'Qwen/Qwen3-32B'),
498510
]),
499511
],
500-
TemplateType.qwen,
512+
TemplateType.qwen3,
501513
get_model_tokenizer_with_flash_attn,
502514
architectures=['Qwen3ForCausalLM'],
503515
requires=['transformers>=4.51'],
@@ -508,10 +520,14 @@ def _get_cast_dtype(self) -> torch.dtype:
508520
LLMModelType.qwen3_moe,
509521
[
510522
ModelGroup([
511-
# Model('Qwen/Qwen3-15B-A2B-Base', 'Qwen/Qwen3-15B-A2B-Base'),
523+
Model('Qwen/Qwen3-30B-A3B-Base', 'Qwen/Qwen3-30B-A3B-Base'),
524+
Model('Qwen/Qwen3-235B-A22B-Base', 'Qwen/Qwen3-235B-A22B-Base'),
525+
# instruct
526+
Model('Qwen/Qwen3-30B-A3B', 'Qwen/Qwen3-30B-A3B'),
527+
Model('Qwen/Qwen3-235B-A22B', 'Qwen/Qwen3-235B-A22B'),
512528
]),
513529
],
514-
TemplateType.qwen,
530+
TemplateType.qwen3,
515531
get_model_tokenizer_with_flash_attn,
516532
architectures=['Qwen3MoeForCausalLM'],
517533
requires=['transformers>=4.51'],

swift/llm/model/patcher.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Copyright (c) Alibaba, Inc. and its affiliates.
2+
import os
23
from contextlib import contextmanager
34
from functools import wraps
45
from types import MethodType
@@ -7,9 +8,9 @@
78
import accelerate
89
import torch
910
import torch.nn as nn
10-
import torch.nn.functional as F
1111
import transformers
1212
from accelerate.utils import find_device
13+
from packaging import version
1314
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
1415
from torch.nn.parallel import DistributedDataParallel as DDP
1516
from transformers import PreTrainedModel, dynamic_module_utils, trainer
@@ -343,3 +344,15 @@ def new_get_cached_module_file(pretrained_model_name_or_path, *args, **kwargs):
343344
yield
344345
finally:
345346
dynamic_module_utils.get_cached_module_file = origin_get_cached_module_file
347+
348+
349+
@contextmanager
350+
def patch_tp_plan():
351+
if not is_mp_ddp() or version.parse(transformers.__version__) < version.parse('4.50'):
352+
yield
353+
return
354+
WORLD_SIZE = os.environ.get('WORLD_SIZE')
355+
os.environ['_PATCH_WORLD_SIZE'] = WORLD_SIZE
356+
os.environ.pop('WORLD_SIZE')
357+
yield
358+
os.environ['WORLD_SIZE'] = WORLD_SIZE

swift/llm/model/register.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
from swift.utils import get_dist_setting, get_logger, is_mp, is_unsloth_available, patch_getattr, use_torchacc
2222
from .constant import ModelType
2323
from .patcher import (patch_automodel, patch_automodel_for_sequence_classification, patch_get_dynamic_module,
24-
patch_mp_ddp)
24+
patch_mp_ddp, patch_tp_plan)
2525
from .utils import AttnImpl, HfConfigFactory, ModelInfo, safe_snapshot_download
2626

2727
GetModelTokenizerFunction = Callable[..., Tuple[Optional[PreTrainedModel], PreTrainedTokenizerBase]]
@@ -567,7 +567,7 @@ def get_model_tokenizer(
567567
kwargs['attn_impl'] = attn_impl
568568
kwargs['rope_scaling'] = rope_scaling
569569
kwargs['model_meta'] = model_meta
570-
with patch_get_dynamic_module():
570+
with patch_get_dynamic_module(), patch_tp_plan():
571571
model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
572572

573573
if not isinstance(processor, PreTrainedTokenizerBase) and hasattr(processor, 'tokenizer'):

swift/llm/template/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ class LLMTemplateType:
1212
qwen2_5 = 'qwen2_5'
1313
qwen2_5_math = 'qwen2_5_math'
1414
qwen2_5_math_prm = 'qwen2_5_math_prm'
15+
qwen3 = 'qwen3'
1516
qwq_preview = 'qwq_preview'
1617
qwq = 'qwq'
1718
marco_o1 = 'marco_o1'

0 commit comments

Comments
 (0)