Skip to content

v3.5.0

Choose a tag to compare

@Jintao-Huang Jintao-Huang released this 08 Jun 16:51
· 637 commits to main since this release

中文版

新特性

  1. GRPO:
    a. 代码重构,使用参数vllm_mode指定。参数说明详见参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#id1:~:text=vllm_mode%20server%20%E5%8F%82%E6%95%B0,colocate%20mode%20%E7%94%9F%E6%95%88%E3%80%82
    b. GRPO长文本优化,支持ulysses序列并行,显著降低长文本训练显存占用,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/long_text/sequence_parallel_grpo.sh
    c. 新增sync_ref_model参数,支持训练中同步参考模型权重。
    d. 支持 liger kernel loss,使用参数 use_liger_kernel,降低显存占用。
    e. External mode 支持 move_model_batches,降低zero3同步权重时的显存峰值。
    f. 集成 INTELLECT-2 的 Two-Sided Clipping 算法,使用参数 delta。
    g. 支持奖励函数返回 None,适用于多任务训练,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#id7
    h. Internal mode 支持 vllm_server_base_url,传入外部 vLLM 服务器url。
    i. 插件拓展:支持 QwenLong-L1 奖励模型插件。
    j. 新增 steps_per_generation/generation_batch_size 参数,支持自定义采样批量大小。
    k. Web-UI支持GRPO训练。
    l. 以下参数将在 v3.6 移除:tensor_parallel_size / vllm_device / vllm_max_num_seqs / num_infer_workers。
  2. 训练:
    a. CPT/SFT/DPO/GRPO 支持 padding free。通过将批次数据展平避免数据填充(padding),显著降低显存并加速训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/padding_free
    b. 多模态训练增强。支持使用 vit_lr 和 aligner_lr 参数独立控制 ViT 和 Aligner 模块的学习率。支持通过 vit_gradient_checkpointing 参数单独控制 vit 模块的 gradient checkpointing,性能基准测试参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh
    c. CPT/SFT支持使用 channel loss 对不同 channel 数据集分别统计损失值。感谢招商银行技术团队的贡献。
    d. CPT/SFT/DPO支持 use_logits_to_keep参数,降低显存占用,提升训练速度。
    e. Qwen2.5-VL/Omni 支持传入图像目录进行视频训练。
  3. 推理部署:
    a. swift infer批处理优化,新增 write_batch_size 参数,用于控制批处理推理结果写入result_path的间隔。
    b. vllm 推理引擎默认使用 V1 engine,并支持TP和DP结合的推理模式,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/dp_tp.sh
  4. Megatron-SWIFT:
    a. 非流式数据集支持通过 max_epochs 自动计算 train_iters。
    b. 提供 extra_megatron_kwargs 参数,支持未写入ms-swift的megatron参数传入。

新模型

  1. Qwen/Qwen3-Embedding-0.6B系列,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train_emb.sh
  2. deepseek-ai/DeepSeek-R1-0528-Qwen3-8B系列,最佳实践参考https://mp.weixin.qq.com/s/-hhfGiiGTqXUybwPH525gw
  3. iic/QwenLong-L1-32B
  4. XiaomiMiMo/MiMo-7B-RL-0530、XiaomiMiMo/MiMo-VL-7B-SFT系列
  5. OpenBMB/MiniCPM4-0.5B系列

English Version

New Features

  1. GRPO:
    a. Code refactored, specified via the vllm_mode parameter. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html#arguments-and-execution-script:~:text=vllm_mode%20server%20parameter,in%20colocate%20mode.
    b. GRPO long-text optimization with Ulysses sequence parallelism, significantly reducing GPU memory usage during long-text training. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/long_text/sequence_parallel_grpo.sh
    c. Added sync_ref_model parameter to synchronize reference model weights during training.
    d. Supports Liger Kernel Loss via use_liger_kernel parameter, reducing GPU memory consumption.
    e. External mode supports move_model_batches to lower peak GPU memory during ZeRO-3 weight synchronization.
    f. Integrated INTELLECT-2’s Two-Sided Clipping algorithm using the delta parameter.
    g. Supports reward functions returning None, applicable for multi-task training. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html#multi-task-training
    h. Internal mode supports vllm_server_base_url for passing external vLLM server URLs.
    i. Plugin extension: Added QwenLong-L1 reward model plugin.
    j. Added steps_per_generation and generation_batch_size parameters for customizing sampling batch size.
    k. Web-UI supports GRPO training.
    l. The following parameters will be deprecated in v3.6: tensor_parallel_size, vllm_device, vllm_max_num_seqs, num_infer_workers.
  2. Training:
    a. CPT/SFT/DPO/GRPO support padding-free training. By flattening batch data to avoid padding, GPU memory usage is reduced and training speed is improved. Script: https://github.com/modelscope/ms-swift/tree/main/examples/train/padding_free
    b. Multimodal training enhancements: Supports separate learning rates for ViT and Aligner modules via vit_lr and aligner_lr parameters. Added vit_gradient_checkpointing to independently control gradient checkpointing for ViT modules. Benchmark: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh
    c. CPT/SFT support channel_loss to separately calculate loss for different channel datasets. Thanks to the contributions from the technical team at China Merchants Bank.
    d. CPT/SFT/DPO support use_logits_to_keep to reduce GPU memory usage and accelerate training.
    e. Qwen2.5-VL/Omni support video training by passing image directories.
  3. Inference & Deployment:
    a. Optimized swift infer batching with new write_batch_size parameter to control inference result write intervals to result_path.
    b. vLLM inference engine now defaults to V1 engine and supports hybrid Tensor Parallelism (TP) and Data Parallelism (DP). Script: https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/dp_tp.sh
  4. Megatron-SWIFT:
    a. Non-streaming datasets automatically calculate train_iters via max_epochs.
    b. Added extra_megatron_kwargs to pass unlisted Megatron parameters into ms-swift.

New Models

  1. Qwen/Qwen3-Embedding-0.6B series. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train_emb.sh
  2. deepseek-ai/DeepSeek-R1-0528-Qwen3-8B series. Best practices: https://mp.weixin.qq.com/s/-hhfGiiGTqXUybwPH525gw
  3. iic/QwenLong-L1-32B
  4. XiaomiMiMo/MiMo-7B-RL-0530 & XiaomiMiMo/MiMo-VL-7B-SFT series
  5. OpenBMB/MiniCPM4-0.5B series

What's Changed

New Contributors

Full Changelog: v3.4.1...v3.5.0