Releases: modelscope/ms-swift
Releases · modelscope/ms-swift
Patch release v3.7.3
Full Changelog: v3.7.2...v3.7.3
Patch release v3.7.2
Full Changelog: v3.7.1...v3.7.2
Patch release v3.7.1
Full Changelog: v3.7.0...v3.7.1
v3.7.0
中文版
新特性
- GRPO:
a. 支持GSPO算法,在GRPO训练中使用参数--importance_sampling_level sequence
,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
b. GRPO server mode 支持多机 rollout,支持传入多个 vllm_server_host/port,参考脚本:https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
c. GRPO rollout 兼容 GYM 环境规范(感谢开发者Mouse的贡献),参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/GYM%E7%8E%AF%E5%A2%83%E8%AE%AD%E7%BB%83.html
d. GRPO 支持 entropy_mask 来过滤低熵token损失计算,同时logger支持记录熵值动态,参考文档https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
e. 支持多轮算法DeepEyes训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
f. GRPO 支持--truncation_strategy delete
,删除输入长度超过max_length的数据,并重新采样。 - Megatron-SWIFT:
a. 支持使LoRA训练,现支持CPT/SFT/DPO,显著加速MoE训练速度。
- 文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html#lora
- 训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
b. 支持loss scale,方便Agent训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
c. 默认megatron-core版本升级至0.13。
d. 支持bshd格式,方便自定义attention_mask。
e. 日志优化:新增GPU占用、剩余训练时间等信息打印,并输出logging.jsonl
存储训练日志。
f. 模型加载与转换速度优化,并增加模型加载进度条。 - 训练:
a. 支持Flash-Attention-3(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
b. 新增--new_speical_tokens
参数,方便新增特殊tokens。训练脚本参考: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
c. 新增--cached_dataset
参数,支持CPT/SFT的离线tokenize。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
d. 序列Packing模块重构。加速Packing速度,并对多模态packing的磁盘存储问题优化。
e. 支持Qwen2.5-VL混合模态数据(即单条数据中含多种模态) + deepspeed训练。
f. 多模态模型训练支持 loss_scale。
g. rope_scaling 支持传入字典,此外支持设置 max_model_len 对 rope_scaling 的 factor 自动调整。
h. 支持DeepSpeed-AutoTP(该技术不支持LoRA)。
i. 多模态Packing兼容 transformers>=4.53;序列并行兼容 transformers>=4.52。
j. resume_only_model默认将进行数据跳过,并使用ignore_data_skip参数进行控制。
k. MoE模型训练支持 router_aux_loss_coef 参数。
l. template新增max_length裁剪保护机制,不对图像/视频等tokens进行裁剪。
m. tuner_backend unsloth 支持moe模型、device_map和DDP。
n. embedding训练支持liger_kernel。 - RLHF:
a. 支持MPO训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
b. 多模态DPO支持了拒绝图片输入,在数据集中加入rejected_images
列。 - 推理部署:
a. 支持embedding系列模型的推理部署,包括pt/vllm/sglang的infer_backend。部署脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
b. InferEngine支持return_details参数,以输出prompt_token_ids和token_ids。
c. vLLM推理引擎兼容更多多模态模型:ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4。
d. vLLM参数重构,参数名前加入vllm_
前缀。GRPO模块复用vLLM参数。 - 导出:
a. QLoRA支持Merge-LoRA,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
b. 支持MoE/多模态模型的FP8/BNB量化,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize
新模型
- 纯文本模型:
a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, Qwen/Qwen3-4B-[Instruct/Thinking]-2507系列(含Megatron-SWIFT),训练脚本参考:#5033
b. openai-mirror/gpt-oss-20b系列,最佳实践参考:#5277
c. ZhipuAI/GLM-4.5系列(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
d. Hunyuan-7B-Instruct系列,最佳实践参考:#5236
e. mistralai/Devstral-Small-2505 - 多模态模型:
a. OpenBMB/MiniCPM-V-4,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh
English Version
New Features
- GRPO
a. Added support for the GSPO algorithm. Use--importance_sampling_level sequence
during GRPO training. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
b. GRPO “server mode” now supports multi-node rollout; pass in multiplevllm_server_host/port
. Example script: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
c. GRPO rollout is now GYM-compatible (thanks to contributor Mouse). Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
d. Addedentropy_mask
for filtering low-entropy tokens during loss computation, and the logger now tracks entropy dynamics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
e. Added support for the multi-round DeepEyes algorithm. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
f. GRPO supports--truncation_strategy delete
: remove samples whose input length exceedsmax_length
and resample. - Megatron-SWIFT
a. Added LoRA training (CPT/SFT/DPO) to significantly accelerate MoE training.
- Docs: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html#lora-training
- Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
b. Added loss-scaling to simplify Agent training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
c. Defaultmegatron-core
upgraded to 0.13.
d. Addedbshd
tensor format to facilitate customattention_mask
.
e. Logging improvements: prints GPU memory, estimated remaining time, and writeslogging.jsonl
.
f. Faster model loading & conversion plus a progress bar. - Training
a. Added Flash-Attention-3 support (including Megatron-SWIFT). Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
b. New--new_special_tokens
flag for adding special tokens. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
c. New--cached_dataset
flag for offline tokenization in CPT/SFT. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
d. Re-implemented the sequence-packing module for faster packing and better multimodal disk I/O.
e. Qwen2.5-VL hybrid-modal data (multiple modalities in a single sample) + DeepSpeed training supported.
f. Multimodal training now supports loss-scaling.
g.rope_scaling
now accepts a dict;max_model_len
can auto-adjust the scaling factor.
h. Added DeepSpeed-AutoTP (not compatible with LoRA).
i. Multimodal packing is compatible with transformers ≥ 4.53; sequence parallelism with transformers ≥ 4.52.
j. Withresume_only_model
, data skipping is enabled by default; control viaignore_data_skip
.
k. MoE training supportsrouter_aux_loss_coef
.
l. Template files get a max_length clipping safeguard (no clipping of image/video tokens).
m.tuner_backend unsloth
now supports MoE models,device_map
, and DDP.
n. Embedding training supportsliger_kernel
. - RLHF
a. Added MPO training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
b. Multimodal DPO can now reject image inputs by adding arejected_images
column. - Inference & Deployment
a. Added deployment for embedding models across pt/vllm/sglang back-ends. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
b.InferEngine
supportsreturn_details
to outputprompt_token_ids
andtoken_ids
.
c. vLLM back-end now supports more multimodal models: ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4.
d. vLLM arguments refactored: all start with thevllm_
prefix. GRPO module reuses the same options. - Export
a. QLoRA now supports Merge-LoRA. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
b. Added FP8 / BNB quantization for MoE and multimodal models. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize
New Models
- Text-only
a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, and Qwen/Qwen3-4B-[Instruct/Thinking]-2507 (Megatron-SWIFT supported). Training script: #5033
b. openai-mirror/gpt-oss-20b family. Best-practice: #5277
c. ZhipuAI/GLM-4.5 family (Megatron-SWIFT supported). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
d. Hunyuan-7B-Instruct family. Best-practice: #5236
e. mistralai/Devstral-Small-2505 - Multimodal
a. OpenBMB/MiniCPM-V-4. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh
What's Changed
- [grpo] fix server arg check by @hjh0119 in #4865
- [SP] clean up imports by @hjh0119 in #4878
- fix loss_scale sp by @tastelikefeet in #4880
- fix seq_cls generation_config by @Jintao-Huang in #4882
- optimize imports by @tastelikefeet in #4883
- [model] fix qwen eos_token by @Jintao-Huang in #4888
- Fix: Correct training hang for Keye-VL on DeepSpeed with mixed data by @0russwest0 in #4889
- [megatron] support LoRA & support loss_scale by @Jintao-Huang in #4812
- update framework.txt by @Jintao-Huang in #4896
- [megatron] fix pp mla by @Jintao-Huang in https://gi...
Patch release v3.6.4
Full Changelog: v3.6.3...v3.6.4
Patch release v3.6.3
Full Changelog: v3.6.2...v3.6.3
Patch release v3.6.2
Full Changelog: v3.6.1...v3.6.2
Patch release v3.6.1
Full Changelog: v3.6.0...v3.6.1
v3.6.0
中文版
新特性
- Megatron-SWIFT:
a. 支持更多的 MoE 模型结构,包括:DeepseekV3ForCausalLM、Dots1ForCausalLM 和 Ernie4_5_MoeForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
b. 支持更多的 Dense 模型结构,包括:MiMoForCausalLM、InternLM3ForCausalLM 和 Ernie4_5_ForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
c. 支持 DPO 训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
d. 支持 FP8 训练。
e. 支持更多 rope scaling 类型,包括:default、linear、yarn、dynamic、longrope、llama3 等。
f.--test_convert_precision
参数优化,方便测试 mcore 与 huggingface 模型权重转换精度。 - GRPO:
a. GRPO 多轮训练重构,支持使用 AsyncEngine 加速多轮推理,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b. offload_model 参数额外对参考模型进行卸载。
c. 优化 sleep_level 和 offload_model 参数下的显存管理。
d. reward_funcs 增加了 trainer_state 入参,方便获取当前训练步数和总步数。 - 训练:
a. 支持 reranker 训练,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
b. CPT/SFT/DPO/GRPO 纯文本大模型训练支持 ring-attention 切分序列长度,降低显存占用。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
c. channel loss 在CPT/SFT训练时,兼容 padding_free 与 packing。 感谢招商银行技术团队的贡献。
d. remove_unused_columns 参数优化。设置为 False,则将额外数据集传递至 Trainer 内,方便自定义损失函数。
e.split_dataset_ratio
参数默认值从0.01修改为0,默认不再进行验证集切分,需要手动设置--split_dataset_ratio
或者--val_dataset
。
f. 多模态模型 packing/padding_free 损失对齐问题修复。详见此PR:#4838
g. swanlab 支持训练完成后的飞书通知回调。 - RLHF:
a. 纯文本/多模态模型支持 GKD 训练,部分场景下支持 padding_free 和 packing,训练脚本如下:
i. 大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
ii. 多模态大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
b. reward model 训练支持 margin 参数支持,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm - 全链路:
a. 支持使用 SGLang 推理引擎对 ms-swift 推理/部署/评测/ui模块进行加速,设置--infer_backend sglang
即可。推理脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
b. 支持 FP8 量化,量化脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh - Web-UI:
a. 支持 SFT/RLHF/GRPO 在不同 Tab 页面训练,支持保存训练命令行。
b. Web-UI 界面支持数据采样。
新模型
- 多模态模型:
a. ZhipuAI/GLM-4.1V-9B-Thinking系列
b. Kwai-Keye/Keye-VL-8B-Preview
c. moonshotai/Kimi-VL-A3B-Thinking-2506
d. google/gemma-3n-E2B-it系列 - 纯文本模型:
a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT系列
b. rednote-hilab/dots.llm1.inst系列
c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
d. MiniMax/MiniMax-M1-80k系列(推理)
e. moonshotai/Kimi-Dev-72B
f. cognitivecomputations/DeepSeek-R1-0528-AWQ
English Version
New Features
- Megatron-SWIFT:
a. Support for more MoE model architectures, including: DeepseekV3ForCausalLM, Dots1ForCausalLM, and Ernie4_5_MoeForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
b. Support for more Dense model architectures, including: MiMoForCausalLM, InternLM3ForCausalLM, and Ernie4_5_ForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
c. DPO training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
d. FP8 training supported.
e. More rope scaling types supported, including: default, linear, yarn, dynamic, longrope, llama3, etc.
f.--test_convert_precision
parameter optimized for easier testing of weight conversion precision between mcore and huggingface models. - GRPO:
a. GRPO multi-turn training refactored, supporting accelerated multi-turn inference with AsyncEngine. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b. The offload_model parameter now also offloads the reference model.
c. Optimized GPU memory management under sleep_level and offload_model parameters.
d. Added trainer_state as an input parameter to reward_funcs, making it easier to obtain the current and total training steps. - Training:
a. Reranker training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
b. CPT/SFT/DPO/GRPO pure-text large model training supports ring-attention sequence length partitioning, reducing memory usage. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
c. Channel loss in CPT/SFT training is compatible with padding_free and packing. Thanks to the technical team at China Merchants Bank for their contribution.
d. Optimized remove_unused_columns parameter. When set to False, extra dataset columns are passed to the Trainer for custom loss functions.
e. The default value forsplit_dataset_ratio
changed from 0.01 to 0, so the validation set is not split by default. You now need to manually set--split_dataset_ratio
or--val_dataset
.
f. Fixed loss alignment issue between packing/padding_free for multimodal models. For details, see this PR: #4838
g. Swanlab now supports Feishu (Lark Suite) notification callback after training is completed. - RLHF:
a. Pure-text and multimodal models support GKD training, with some scenarios supporting padding_free and packing. Training scripts:
i. Large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
ii. Multimodal large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
b. Reward model training now supports the margin parameter. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm - Full Pipeline:
a. SGLang inference engine can be used to accelerate ms-swift inference/deployment/evaluation/ui modules, by setting--infer_backend sglang
. Inference script reference: https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
b. FP8 quantization supported. Quantization script reference: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh - Web-UI:
a. Supports SFT/RLHF/GRPO training on different Tab pages, and saves training command lines.
b. Web-UI interface supports data sampling.
New Models
- Multimodal Models:
a. ZhipuAI/GLM-4.1V-9B-Thinking series
b. Kwai-Keye/Keye-VL-8B-Preview
c. moonshotai/Kimi-VL-A3B-Thinking-2506
d. google/gemma-3n-E2B-it series - Pure Text Models:
a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT series
b. rednote-hilab/dots.llm1.inst series
c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
d. MiniMax/MiniMax-M1-80k series (inference)
e. moonshotai/Kimi-Dev-72B
f. cognitivecomputations/DeepSeek-R1-0528-AWQ
What's Changed
- fix emb script and docs by @tastelikefeet in #4521
- [grpo] update doc about move_model_batches by @hjh0119 in #4523
- fix LoraModel by @Jintao-Huang in #4536
- support cognitivecomputations/DeepSeek-R1-0528-AWQ by @Jintao-Huang in #4537
- fix: handle INFONCE_HARD_NEGATIVES as integer if provided by @dlutwy in #4545
- fix qwen3 embedding saving by @tastelikefeet in #4548
- [megatron/dpo] fix megatron packing_cache & update DPOTrainer by @Jintao-Huang in #4556
- [megatron] support DPO by @Jintao-Huang in #4193
- support dots1 by @Jintao-Huang in #4560
- [grpo] support offloading reference model by @hjh0119 in #4554
- [grpo] fix the pickle data collator by @hjh0119 in #4562
- [dataset] fix toolbench (local) by @Jintao-Huang in #4563
- [Bug]Fix ulysses train steps, embedding negative sample length by @tastelikefeet in #4565
- fix args.json by @Jintao-Huang in #4566
- [model] fix ovis gradient_checkpointing vit no_grad by @Jintao-Huang in #4571
- [megatron] Fix megatron all_reduce warning by @Jintao-Huang in #4568
- [grpo] remove data collator to top-level to avoid pickle error in spawn mode by @hjh0119 in #4582
- [grpo] model weight synchronization before first turn rollout with async generation by @hjh0119 in #4584
- [megatron] support more rope_scaling & support deepseek-r1-qwen3-8b/internlm3/mimo-7b by @Jintao-Huang in #4576
- [grpo] restore num_generations check by @hjh0119 in #4590
- fix gc_kwargs by @Jintao-Huang in #4591
- Fix UI llm_train by @slin000111 in #4592
- [mirror] update swift mirror by @Jintao-Huang in #4601
- [megatron] compat megatron-core main branch by @Jintao-Huang in https://github.com/modelscope/ms-swift...
Patch release v3.5.3
Full Changelog: v3.5.2...v3.5.3