Releases · modelscope/ms-swift

Megatron-SWIFT
a. 支持更多模型架构：Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF等。完整的模型支持情况，参考支持的模型文档：https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. 支持KTO训练，包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。感谢招商银行技术团队@kevssim 的贡献。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. 支持RM训练，包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. 支持序列分类模型架构，包括三种任务：regression、single_label_classification、multi_label_classification。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. 支持VPP并行技术，减少PP并行的计算空泡，提高GPU利用率，但会略微提高通信量。支持异构PP并行 pipeline_model_parallel_layout，自定义流水线并行（PP/VPP）布局。
f. DPO等RLHF技术中的ref_model不初始化 main_grad 降低显存占用。
训练
a. 序列并行优化，ulysses 和 ring-attention 支持混合使用，实现更长的序列处理能力。支持纯文本和多模态模型的SFT/DPO/GRPO训练。训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. 纯文本及多模态模型Embedding/Reranker/序列分类任务训练支持使用 padding_free 以节约显存资源并加速训练。
c. Embedding和Reranker训练数据集格式重构，具体参考文档：https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent template支持更多模型：deepseek_v3_1, qwen3_coder。（感谢@gakkiri ,@ray075hl 的贡献）
e. load_from_cache_file 默认值从True改成False，避免因缓存原因导致的未知问题。
RLHF
a. GRPO支持CHORD算法，在GRPO训练中混合SFT训练，参考文档：https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO支持padding free和packing以节约显存资源并加速训练。
c. GRPO训练 padding_free重构，更好支持多模态模型。
d. GRPO vLLM 支持PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"环境变量，减小显存碎片。
推理
a. 支持Reranker任务的推理/部署 (pt/vllm)，以及序列分类任务的推理部署（pt/vllm）。脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls

新模型

纯文本模型
a. Qwen/Qwen3-Next-80B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0系列
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct系列（感谢@hpsun1109 的贡献）
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m（embedding模型）
多模态模型
a. Qwen/Qwen3-VL-30B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B（感谢@hellopahe 的贡献）
d. OpenGVLab/InternVL3_5-1B-HF系列
e. BytedanceDouyinContent/SAIL-VL2-2B系列
f. stepfun-ai/Step-Audio-2-mini（感谢@CJack812 的贡献）

English Version

New Features

Megatron-SWIFT
a. More model architecture support: Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF, etc. For a complete list of supported models, please refer to the Supported Models documentation: https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. KTO training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Special thanks to @kevssim from China Merchants Bank’s technical team for their contribution. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. Reward Model training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. Sequence classification model architecture support, covering three task types: regression, single_label_classification, and multi_label_classification. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. Support for VPP (Virtual Pipeline Parallelism): reduces pipeline bubbles in PP (Pipeline Parallelism), improving GPU utilization at the cost of slightly increased communication overhead. Supports heterogeneous PP via pipeline_model_parallel_layout for custom PP/VPP pipeline layouts.
f. In RLHF techniques such as DPO, the ref_model no longer initializes main_grad, reducing GPU memory consumption.
Training
a. Sequence parallelism optimization: Ulysses and Ring Attention can now be used together, enabling processing of even longer sequences. Supports SFT/DPO/GRPO training for both text-only and multimodal models. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. Padding-free training is now supported for embedding, reranker, and sequence classification tasks on both text-only and multimodal models, saving GPU memory and accelerating training.
c. Restructured dataset formats for embedding and reranker training. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent templates support more models: deepseek_v3_1, qwen3_coder. (Thanks to contributions from @gakkiri and @ray075hl)
e. Default value of load_from_cache_file changed from True to False to avoid unexpected issues caused by caching.
RLHF
a. GRPO now supports the CHORD algorithm, enabling mixed SFT training during GRPO. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO supports padding-free and packing, reducing memory usage and accelerating training.
c. Padding-free implementation in GRPO has been refactored for better multimodal model support.
d. GRPO with vLLM now supports the environment variable PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" to reduce GPU memory fragmentation.
Inference
a. Inference and deployment support for Reranker tasks (PyTorch/vLLM) and sequence classification tasks (PyTorch/vLLM). Example scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls

New Models

Text-only Models
a. Qwen/Qwen3-Next-80B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0 series
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct series (Thanks to @hpsun1109 for the contribution)
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m (embedding model)
Multimodal Models
a. Qwen/Qwen3-VL-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B (Thanks to @hellopahe for the contribution)
d. OpenGVLab/InternVL3_5-1B-HF series
e. BytedanceDouyinContent/SAIL-VL2-2B series
f. stepfun-ai/Step-Audio-2-mini (Thanks to @CJack812 for the contribution)

What's Changed

Merge ulysses and ring-attention by @tastelikefeet in #5522
[bugfix] fix text_position_ids by @Jintao-Huang in #5692
[grpo] support CHORD algorithm by @hjh0119 in #5680
[doc] update chord doc by @hjh0119 in #5701
[bugfix]: use GCD to robustly configure sp and rp dimensions for any world_size by @0russwest0 in #5698
[megatron] Fix SP & LoRA by @Jintao-Huang in #5704
[megatron] Support ovis2.5 by @Jintao-Huang in #5719
[template] update get_env_args & load_from_cache_file by @Jintao-Huang in #5730
[bugfix] fix qwen3 swift pt by @Jintao-Huang in #5741
fix sp grpo by @tastelikefeet in #5744
Fix multiple input issue and more_params for web-ui by @slin000111 in #5739
[bugfix] set default padding side to left for generative reranker by @0russwest0 in #5751
[bugfix] correct multi-GPU reranker evaluation metric calculation by @0russwest0 in #5755
wrap base_model into get_llm_model by @tastelikefeet in #5749
[bugfix] fix forward_context by @Jintao-Huang in #5757
[bugfix] update use_barrier -> True by @Jintao-Huang in #5763
support Seed-OSS-36B-Instruct by @hpsun1109 in #5761
[bugfix] fix megatron model_type by @Jintao-Huang in #5767
Refactor grpo padding free by @tastelikefeet in #5769
Update seed.py by @hpsun1109 in #5725
[model] Support qwen3_next (transformers) by @Jintao-Huang in #5782
[megatron] fix text_position_ids by @Jintao-Huang in #5783
[model] su...

Contributors

hpsun1109, kiritoxkiriko, and 14 other contributors

Assets 2

01 Oct 14:01

Jintao-Huang

v3.8.3

35b37cb

Patch release v3.8.3

Full Changelog: v3.8.2...v3.8.3

Assets 2

23 Sep 15:18

Jintao-Huang

v3.8.2

a5e9c7d

Patch release v3.8.2

Full Changelog: v3.8.1...v3.8.2

Assets 2

15 Sep 15:55

Jintao-Huang

v3.8.1

8ec599e

Patch release v3.8.1

Full Changelog: v3.8.0...v3.8.1

Assets 2

09 Sep 02:38

Jintao-Huang

v3.8.0

a690057

v3.8.0

中文版

新特性

Megatron-SWIFT
a. 支持多模态模型训练，包含LoRA/全参训练（CPT/SFT/DPO）。目前支持了Qwen2-VL、Qwen2.5-VL、Qwen2.5-Omni、InternVL3、InternVL3.5、GLM-4.5V、Ovis2.5系列模型。文档参考：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/%E5%A4%9A%E6%A8%A1%E6%80%81%E6%A8%A1%E5%9E%8B.html 。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimoda
b. 支持Merge-LoRA，方便使用LoRA进行SFT后，Merge-LoRA进行DPO训练。文档参考：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/LoRA%E8%AE%AD%E7%BB%83.html#merge-lora
c. 支持channel loss，使用--enable_channel_loss参数开启。在数据集中准备"channel"字段，ms-swift会根据该字段分组统计loss。数据集准备参考文档：https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html#channel-loss
d. 支持对MoE的router部分进行LoRA训练。设置--target_modules all-router all-linear即可。
e. 支持deepspeed launcher启动训练。
f. 支持在权重转换时，将显存无法存放的hf_model部分offload到cpu。
GRPO
a. GRPO多轮重构，支持自由度更高的多轮训练，详见多轮训练文档：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b. --truncation_strategy delete参数支持跳过encode失败的数据。
训练
a. 支持了DFT loss, 在SFT训练中设置参数--enable_dft_loss true使用该功能。（含Megatron-SWIFT），实验结果参考此PR：#5355
b. 数据集中支持通过增加"loss"字段，控制每一轮对话是否计算损失。文档参考：https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html#id4
c. 在混合思考模型的训练时，自动填充 no_think_prefix（当样本不含思考部分时），例如对Qwen/Qwen3-30B-A3B模型自动填充<think>\n\n</think>\n\n。
d. 支持early_stop_interval参数，确保best_metric在early_stop_interval个周期内没有提升时终止训练。
e. MoE训练参数router_aux_loss_coef默认值从config.json中读取修改为默认值为0。（Megatron-SWIFT同步修改）
f. channel loss重构，删除--channels参数，使用--enable_channel_loss参数替代。
g. 新增ROOT_IMAGE_DIR环境变量，指定图像（多模态）资源的根目录。
h. 支持DLRover flash checkpoint进行权重异步持久化（暂不支持safetensors格式），感谢招商银行技术团队的贡献。
i. Qwen2.5-Omni支持序列分类任务；并支持单样本中含混合模态数据训练。
j. 支持target_parameters参数，该特性需要安装"peft>=0.17.0"。
k. 支持GLM4.5 agent template。
RLHF
a. LD-DPO支持，使用ld_alpha参数对超出公共前缀部分的logps加权，抑制长度偏好。
b. DPO支持packing，提升训练吞吐量。（含Megatron-SWIFT）训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/dpo.sh ；https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/dpo/packing.sh
c. 数据集"rejected_messages"格式支持，提供比"rejected_response"格式更大的可拓展性（例如多模态/Agent场景）。文档参考：https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html#rlhf
d. ref_adapters参数支持，方便在LoRA SFT之后衔接DPO/KTO/GRPO的场景。（在Megatron-SWIFT中，该参数名为ref_adapter_load）
e. DPO训练参数rpo_alpha默认值从1修改为None，与TRL参数默认值对齐。（Megatron-SWIFT同步修改）
全链路能力
a. swift eval模块升级使用"evalscope>=1.0"。
b. 推理RequestConfig支持return_details参数返回图像在template中缩放后的尺寸，方便grounding任务推理时画目标框。例子参考：https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py
c. vllm支持更多多模型模型：ovis2.5、interns1、internvl3.5。
d. vllm新增disable_cascade_attn参数支持。

新模型

纯文本模型：
a. deepseek-ai/DeepSeek-V3.1（含Megatron-SWIFT）
b. moonshotai/Kimi-K2-Instruct-0905
c. ByteDance-Seed/Seed-OSS-36B-Instruct
d. meituan-longcat/LongCat-Flash-Chat
e. google/gemma-3-270m-it 系列
多模态模型：
a. AIDC-AI/Ovis2.5-2B 系列（支持padding_free/packing），训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/models/ovis2/train.sh
b. OpenGVLab/InternVL3_5-1B 系列（含Megatron-SWIFT，支持混合模态数据集训练和padding_free/packing），训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/lora.sh
c. ZhipuAI/GLM-4.5V（含Megatron-SWIFT，支持混合模态数据集训练和padding_free/packing），训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/glm4_5v.sh
d. OpenBMB/MiniCPM-V-4_5
e. rednote-hilab/dots.ocr
f. Shanghai_AI_Laboratory/Intern-S1-mini 系列
g. mispeech/midashenglm-7b

English Version

New Features

Megatron-SWIFT
a. Supports multimodal model training including LoRA/full-parameter training (CPT/SFT/DPO). Currently supports Qwen2-VL, Qwen2.5-VL, Qwen2.5-Omni, InternVL3, InternVL3.5, GLM-4.5V, and Ovis2.5 series models. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Multimodal-Model.html . Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal
b. Supports Merge-LoRA, enabling LoRA-based SFT followed by DPO training via merged LoRA weights. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/LoRA-Training.html#merge-lora
c. Supports channel loss via the --enable_channel_loss flag. Include a "channel" field in your dataset; ms-swift will group and compute loss accordingly. Dataset preparation guide: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#channel-loss
d. Supports LoRA training on MoE router components. Set --target_modules all-router all-linear.
e. Supports training launched via DeepSpeed launcher.
f. During weight conversion, supports offloading parts of the Hugging Face model that exceed GPU memory to CPU.
GRPO
a. GRPO multi-round refactoring enables higher flexibility in multi-round training. See detailed documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/multi_turn.html
b. The --truncation_strategy delete parameter skips data that fails encoding.
Training
a. Supports DFT loss. Enable with --enable_dft_loss true during SFT training (including Megatron-SWIFT). See experimental results in this PR: #5355
b. Datasets now support a "loss" field to control whether loss is computed for each conversation turn. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning
c. Automatically fills no_think_prefix during mixed-thinking model training (when samples lack reasoning segments), e.g., inserts <think>\n\n</think>\n\n for Qwen/Qwen3-30B-A3B models.
d. Supports early_stop_interval parameter to terminate training if best_metric does not improve over early_stop_interval epochs.
e. Default value of MoE training parameter router_aux_loss_coef changed from config.json to 0 (also updated in Megatron-SWIFT).
f. Refactored channel loss: removed --channels parameter, replaced with --enable_channel_loss.
g. Added ROOT_IMAGE_DIR environment variable to specify root directory for image (multimodal) resources.
h. Supports DLRover flash checkpoint for asynchronous weight persistence (safetensors format not yet supported). Thanks to contributions from China Merchants Bank's tech team.
i. Qwen2.5-Omni now supports sequence classification tasks and training with mixed-modality data within a single sample.
j. Supports target_parameters feature (requires "peft>=0.17.0").
k. Supports GLM4.5 agent template.
RLHF
a. Supports LD-DPO: use ld_alpha to weight logps beyond the common prefix, suppressing length bias.
b. DPO supports packing to improve training throughput (including Megatron-SWIFT). Training script references:
https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/dpo.sh ; https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/dpo/packing.sh
c. Supports "rejected_messages" dataset format, offering greater extensibility than "rejected_response" (e.g., for multimodal/Agent scenarios). Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#dpo-orpo-cpo-simpo-rm
d. Supports ref_adapters parameter for seamless transition from LoRA SFT to DPO/KTO/GRPO (named ref_adapter_load in Megatron-SWIFT).
e. Default value of DPO training parameter rpo_alpha changed from 1 to None, aligning with TRL defaults (also updated in Megatron-SWIFT).
End-to-End Capabilities
a. Upgraded swift eval module to use "evalscope>=1.0".
b. Inference RequestConfig now supports return_details to return resized image dimensions after template processing, aiding bounding box drawing in grounding tasks. Example: https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py
c. vLLM now supports more multimodal models: ovis2.5, interns1, internvl3.5.
d. vLLM adds disable_cascade_attn parameter support.

New Models

Text-only Models:
a. deepseek-ai/DeepSeek-V3.1 (including Megatron-SWIFT)
b. moonshotai/Kimi-K2-Instruct-0905
c. ByteDance-Seed/Seed-OSS-36B-Instruct
d. meituan-longcat/LongCat-Flash-Chat
e. google/gemma-3-270m-it series
Multimodal Models:
a. AIDC-AI/Ovis2.5-2B series (supports padding-free/packing). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/models/ovis2/train.sh
b. OpenGVLab/InternVL3_5-1B series (including Megatron-SWIFT, supports mixed-modality datasets and padding-free/packing). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/lora.sh
c. ZhipuAI/GLM-4.5V (including Megatron-SWIFT, supports mixed-modality datasets and padding-free/packing). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/glm4_5v.sh
d. OpenBMB/MiniCPM-V-4_5
e. rednote-hilab/dots.ocr
f. Shanghai_AI_Laboratory/Intern-S1-mini series
g. mispeech/midashenglm-7b