Skip to content

Releases: modelscope/ms-swift

Patch release v3.7.3

30 Aug 08:28
Compare
Choose a tag to compare

Patch release v3.7.2

21 Aug 16:17
Compare
Choose a tag to compare

Patch release v3.7.1

16 Aug 09:37
Compare
Choose a tag to compare

v3.7.0

07 Aug 07:05
Compare
Choose a tag to compare

中文版

新特性

  1. GRPO:
    a. 支持GSPO算法,在GRPO训练中使用参数--importance_sampling_level sequence,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
    b. GRPO server mode 支持多机 rollout,支持传入多个 vllm_server_host/port,参考脚本:https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
    c. GRPO rollout 兼容 GYM 环境规范(感谢开发者Mouse的贡献),参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/GYM%E7%8E%AF%E5%A2%83%E8%AE%AD%E7%BB%83.html
    d. GRPO 支持 entropy_mask 来过滤低熵token损失计算,同时logger支持记录熵值动态,参考文档https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
    e. 支持多轮算法DeepEyes训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
    f. GRPO 支持--truncation_strategy delete,删除输入长度超过max_length的数据,并重新采样。
  2. Megatron-SWIFT:
    a. 支持使LoRA训练,现支持CPT/SFT/DPO,显著加速MoE训练速度。
    - 文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html#lora
    - 训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
    b. 支持loss scale,方便Agent训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
    c. 默认megatron-core版本升级至0.13。
    d. 支持bshd格式,方便自定义attention_mask。
    e. 日志优化:新增GPU占用、剩余训练时间等信息打印,并输出logging.jsonl存储训练日志。
    f. 模型加载与转换速度优化,并增加模型加载进度条。
  3. 训练:
    a. 支持Flash-Attention-3(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
    b. 新增--new_speical_tokens参数,方便新增特殊tokens。训练脚本参考: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
    c. 新增--cached_dataset参数,支持CPT/SFT的离线tokenize。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
    d. 序列Packing模块重构。加速Packing速度,并对多模态packing的磁盘存储问题优化。
    e. 支持Qwen2.5-VL混合模态数据(即单条数据中含多种模态) + deepspeed训练。
    f. 多模态模型训练支持 loss_scale。
    g. rope_scaling 支持传入字典,此外支持设置 max_model_len 对 rope_scaling 的 factor 自动调整。
    h. 支持DeepSpeed-AutoTP(该技术不支持LoRA)。
    i. 多模态Packing兼容 transformers>=4.53;序列并行兼容 transformers>=4.52。
    j. resume_only_model默认将进行数据跳过,并使用ignore_data_skip参数进行控制。
    k. MoE模型训练支持 router_aux_loss_coef 参数。
    l. template新增max_length裁剪保护机制,不对图像/视频等tokens进行裁剪。
    m. tuner_backend unsloth 支持moe模型、device_map和DDP。
    n. embedding训练支持liger_kernel。
  4. RLHF:
    a. 支持MPO训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
    b. 多模态DPO支持了拒绝图片输入,在数据集中加入rejected_images列。
  5. 推理部署:
    a. 支持embedding系列模型的推理部署,包括pt/vllm/sglang的infer_backend。部署脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
    b. InferEngine支持return_details参数,以输出prompt_token_ids和token_ids。
    c. vLLM推理引擎兼容更多多模态模型:ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4。
    d. vLLM参数重构,参数名前加入vllm_前缀。GRPO模块复用vLLM参数。
  6. 导出:
    a. QLoRA支持Merge-LoRA,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
    b. 支持MoE/多模态模型的FP8/BNB量化,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize

新模型

  1. 纯文本模型:
    a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, Qwen/Qwen3-4B-[Instruct/Thinking]-2507系列(含Megatron-SWIFT),训练脚本参考:#5033
    b. openai-mirror/gpt-oss-20b系列,最佳实践参考:#5277
    c. ZhipuAI/GLM-4.5系列(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
    d. Hunyuan-7B-Instruct系列,最佳实践参考:#5236
    e. mistralai/Devstral-Small-2505
  2. 多模态模型:
    a. OpenBMB/MiniCPM-V-4,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh

English Version

New Features

  1. GRPO
    a. Added support for the GSPO algorithm. Use --importance_sampling_level sequence during GRPO training. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
    b. GRPO “server mode” now supports multi-node rollout; pass in multiple vllm_server_host/port. Example script: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
    c. GRPO rollout is now GYM-compatible (thanks to contributor Mouse). Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
    d. Added entropy_mask for filtering low-entropy tokens during loss computation, and the logger now tracks entropy dynamics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
    e. Added support for the multi-round DeepEyes algorithm. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
    f. GRPO supports --truncation_strategy delete: remove samples whose input length exceeds max_length and resample.
  2. Megatron-SWIFT
    a. Added LoRA training (CPT/SFT/DPO) to significantly accelerate MoE training.
    - Docs: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html#lora-training
    - Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
    b. Added loss-scaling to simplify Agent training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
    c. Default megatron-core upgraded to 0.13.
    d. Added bshd tensor format to facilitate custom attention_mask.
    e. Logging improvements: prints GPU memory, estimated remaining time, and writes logging.jsonl.
    f. Faster model loading & conversion plus a progress bar.
  3. Training
    a. Added Flash-Attention-3 support (including Megatron-SWIFT). Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
    b. New --new_special_tokens flag for adding special tokens. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
    c. New --cached_dataset flag for offline tokenization in CPT/SFT. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
    d. Re-implemented the sequence-packing module for faster packing and better multimodal disk I/O.
    e. Qwen2.5-VL hybrid-modal data (multiple modalities in a single sample) + DeepSpeed training supported.
    f. Multimodal training now supports loss-scaling.
    g. rope_scaling now accepts a dict; max_model_len can auto-adjust the scaling factor.
    h. Added DeepSpeed-AutoTP (not compatible with LoRA).
    i. Multimodal packing is compatible with transformers ≥ 4.53; sequence parallelism with transformers ≥ 4.52.
    j. With resume_only_model, data skipping is enabled by default; control via ignore_data_skip.
    k. MoE training supports router_aux_loss_coef.
    l. Template files get a max_length clipping safeguard (no clipping of image/video tokens).
    m. tuner_backend unsloth now supports MoE models, device_map, and DDP.
    n. Embedding training supports liger_kernel.
  4. RLHF
    a. Added MPO training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
    b. Multimodal DPO can now reject image inputs by adding a rejected_images column.
  5. Inference & Deployment
    a. Added deployment for embedding models across pt/vllm/sglang back-ends. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
    b. InferEngine supports return_details to output prompt_token_ids and token_ids.
    c. vLLM back-end now supports more multimodal models: ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4.
    d. vLLM arguments refactored: all start with the vllm_ prefix. GRPO module reuses the same options.
  6. Export
    a. QLoRA now supports Merge-LoRA. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
    b. Added FP8 / BNB quantization for MoE and multimodal models. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize

New Models

  1. Text-only
    a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, and Qwen/Qwen3-4B-[Instruct/Thinking]-2507 (Megatron-SWIFT supported). Training script: #5033
    b. openai-mirror/gpt-oss-20b family. Best-practice: #5277
    c. ZhipuAI/GLM-4.5 family (Megatron-SWIFT supported). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
    d. Hunyuan-7B-Instruct family. Best-practice: #5236
    e. mistralai/Devstral-Small-2505
  2. Multimodal
    a. OpenBMB/MiniCPM-V-4. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh

What's Changed

Read more

Patch release v3.6.4

02 Aug 06:35
Compare
Choose a tag to compare

Patch release v3.6.3

29 Jul 06:24
Compare
Choose a tag to compare

Patch release v3.6.2

18 Jul 08:18
Compare
Choose a tag to compare

Patch release v3.6.1

11 Jul 02:14
Compare
Choose a tag to compare

v3.6.0

08 Jul 03:35
Compare
Choose a tag to compare

中文版

新特性

  1. Megatron-SWIFT:
    a. 支持更多的 MoE 模型结构,包括:DeepseekV3ForCausalLM、Dots1ForCausalLM 和 Ernie4_5_MoeForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
    b. 支持更多的 Dense 模型结构,包括:MiMoForCausalLM、InternLM3ForCausalLM 和 Ernie4_5_ForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
    c. 支持 DPO 训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
    d. 支持 FP8 训练。
    e. 支持更多 rope scaling 类型,包括:default、linear、yarn、dynamic、longrope、llama3 等。
    f. --test_convert_precision参数优化,方便测试 mcore 与 huggingface 模型权重转换精度。
  2. GRPO:
    a. GRPO 多轮训练重构,支持使用 AsyncEngine 加速多轮推理,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
    b. offload_model 参数额外对参考模型进行卸载。
    c. 优化 sleep_level 和 offload_model 参数下的显存管理。
    d. reward_funcs 增加了 trainer_state 入参,方便获取当前训练步数和总步数。
  3. 训练:
    a. 支持 reranker 训练,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
    b. CPT/SFT/DPO/GRPO 纯文本大模型训练支持 ring-attention 切分序列长度,降低显存占用。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
    c. channel loss 在CPT/SFT训练时,兼容 padding_free 与 packing。 感谢招商银行技术团队的贡献。
    d. remove_unused_columns 参数优化。设置为 False,则将额外数据集传递至 Trainer 内,方便自定义损失函数。
    e. split_dataset_ratio参数默认值从0.01修改为0,默认不再进行验证集切分,需要手动设置--split_dataset_ratio或者--val_dataset
    f. 多模态模型 packing/padding_free 损失对齐问题修复。详见此PR:#4838
    g. swanlab 支持训练完成后的飞书通知回调。
  4. RLHF:
    a. 纯文本/多模态模型支持 GKD 训练,部分场景下支持 padding_free 和 packing,训练脚本如下:
    i. 大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
    ii. 多模态大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
    b. reward model 训练支持 margin 参数支持,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm
  5. 全链路:
    a. 支持使用 SGLang 推理引擎对 ms-swift 推理/部署/评测/ui模块进行加速,设置--infer_backend sglang即可。推理脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
    b. 支持 FP8 量化,量化脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh
  6. Web-UI:
    a. 支持 SFT/RLHF/GRPO 在不同 Tab 页面训练,支持保存训练命令行。
    b. Web-UI 界面支持数据采样。

新模型

  1. 多模态模型:
    a. ZhipuAI/GLM-4.1V-9B-Thinking系列
    b. Kwai-Keye/Keye-VL-8B-Preview
    c. moonshotai/Kimi-VL-A3B-Thinking-2506
    d. google/gemma-3n-E2B-it系列
  2. 纯文本模型:
    a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT系列
    b. rednote-hilab/dots.llm1.inst系列
    c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
    d. MiniMax/MiniMax-M1-80k系列(推理)
    e. moonshotai/Kimi-Dev-72B
    f. cognitivecomputations/DeepSeek-R1-0528-AWQ

English Version

New Features

  1. Megatron-SWIFT:
    a. Support for more MoE model architectures, including: DeepseekV3ForCausalLM, Dots1ForCausalLM, and Ernie4_5_MoeForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
    b. Support for more Dense model architectures, including: MiMoForCausalLM, InternLM3ForCausalLM, and Ernie4_5_ForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
    c. DPO training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
    d. FP8 training supported.
    e. More rope scaling types supported, including: default, linear, yarn, dynamic, longrope, llama3, etc.
    f. --test_convert_precision parameter optimized for easier testing of weight conversion precision between mcore and huggingface models.
  2. GRPO:
    a. GRPO multi-turn training refactored, supporting accelerated multi-turn inference with AsyncEngine. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
    b. The offload_model parameter now also offloads the reference model.
    c. Optimized GPU memory management under sleep_level and offload_model parameters.
    d. Added trainer_state as an input parameter to reward_funcs, making it easier to obtain the current and total training steps.
  3. Training:
    a. Reranker training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
    b. CPT/SFT/DPO/GRPO pure-text large model training supports ring-attention sequence length partitioning, reducing memory usage. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
    c. Channel loss in CPT/SFT training is compatible with padding_free and packing. Thanks to the technical team at China Merchants Bank for their contribution.
    d. Optimized remove_unused_columns parameter. When set to False, extra dataset columns are passed to the Trainer for custom loss functions.
    e. The default value for split_dataset_ratio changed from 0.01 to 0, so the validation set is not split by default. You now need to manually set --split_dataset_ratio or --val_dataset.
    f. Fixed loss alignment issue between packing/padding_free for multimodal models. For details, see this PR: #4838
    g. Swanlab now supports Feishu (Lark Suite) notification callback after training is completed.
  4. RLHF:
    a. Pure-text and multimodal models support GKD training, with some scenarios supporting padding_free and packing. Training scripts:
    i. Large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
    ii. Multimodal large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
    b. Reward model training now supports the margin parameter. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm
  5. Full Pipeline:
    a. SGLang inference engine can be used to accelerate ms-swift inference/deployment/evaluation/ui modules, by setting --infer_backend sglang. Inference script reference: https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
    b. FP8 quantization supported. Quantization script reference: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh
  6. Web-UI:
    a. Supports SFT/RLHF/GRPO training on different Tab pages, and saves training command lines.
    b. Web-UI interface supports data sampling.

New Models

  1. Multimodal Models:
    a. ZhipuAI/GLM-4.1V-9B-Thinking series
    b. Kwai-Keye/Keye-VL-8B-Preview
    c. moonshotai/Kimi-VL-A3B-Thinking-2506
    d. google/gemma-3n-E2B-it series
  2. Pure Text Models:
    a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT series
    b. rednote-hilab/dots.llm1.inst series
    c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
    d. MiniMax/MiniMax-M1-80k series (inference)
    e. moonshotai/Kimi-Dev-72B
    f. cognitivecomputations/DeepSeek-R1-0528-AWQ

What's Changed

Read more

Patch release v3.5.3

27 Jun 05:12
Compare
Choose a tag to compare