v3.6.0
中文版
新特性
- Megatron-SWIFT:
a. 支持更多的 MoE 模型结构,包括:DeepseekV3ForCausalLM、Dots1ForCausalLM 和 Ernie4_5_MoeForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
b. 支持更多的 Dense 模型结构,包括:MiMoForCausalLM、InternLM3ForCausalLM 和 Ernie4_5_ForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
c. 支持 DPO 训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
d. 支持 FP8 训练。
e. 支持更多 rope scaling 类型,包括:default、linear、yarn、dynamic、longrope、llama3 等。
f.--test_convert_precision参数优化,方便测试 mcore 与 huggingface 模型权重转换精度。 - GRPO:
a. GRPO 多轮训练重构,支持使用 AsyncEngine 加速多轮推理,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b. offload_model 参数额外对参考模型进行卸载。
c. 优化 sleep_level 和 offload_model 参数下的显存管理。
d. reward_funcs 增加了 trainer_state 入参,方便获取当前训练步数和总步数。 - 训练:
a. 支持 reranker 训练,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
b. CPT/SFT/DPO/GRPO 纯文本大模型训练支持 ring-attention 切分序列长度,降低显存占用。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
c. channel loss 在CPT/SFT训练时,兼容 padding_free 与 packing。 感谢招商银行技术团队的贡献。
d. remove_unused_columns 参数优化。设置为 False,则将额外数据集传递至 Trainer 内,方便自定义损失函数。
e.split_dataset_ratio参数默认值从0.01修改为0,默认不再进行验证集切分,需要手动设置--split_dataset_ratio或者--val_dataset。
f. 多模态模型 packing/padding_free 损失对齐问题修复。详见此PR:#4838
g. swanlab 支持训练完成后的飞书通知回调。 - RLHF:
a. 纯文本/多模态模型支持 GKD 训练,部分场景下支持 padding_free 和 packing,训练脚本如下:
i. 大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
ii. 多模态大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
b. reward model 训练支持 margin 参数支持,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm - 全链路:
a. 支持使用 SGLang 推理引擎对 ms-swift 推理/部署/评测/ui模块进行加速,设置--infer_backend sglang即可。推理脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
b. 支持 FP8 量化,量化脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh - Web-UI:
a. 支持 SFT/RLHF/GRPO 在不同 Tab 页面训练,支持保存训练命令行。
b. Web-UI 界面支持数据采样。
新模型
- 多模态模型:
a. ZhipuAI/GLM-4.1V-9B-Thinking系列
b. Kwai-Keye/Keye-VL-8B-Preview
c. moonshotai/Kimi-VL-A3B-Thinking-2506
d. google/gemma-3n-E2B-it系列 - 纯文本模型:
a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT系列
b. rednote-hilab/dots.llm1.inst系列
c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
d. MiniMax/MiniMax-M1-80k系列(推理)
e. moonshotai/Kimi-Dev-72B
f. cognitivecomputations/DeepSeek-R1-0528-AWQ
English Version
New Features
- Megatron-SWIFT:
a. Support for more MoE model architectures, including: DeepseekV3ForCausalLM, Dots1ForCausalLM, and Ernie4_5_MoeForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
b. Support for more Dense model architectures, including: MiMoForCausalLM, InternLM3ForCausalLM, and Ernie4_5_ForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
c. DPO training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
d. FP8 training supported.
e. More rope scaling types supported, including: default, linear, yarn, dynamic, longrope, llama3, etc.
f.--test_convert_precisionparameter optimized for easier testing of weight conversion precision between mcore and huggingface models. - GRPO:
a. GRPO multi-turn training refactored, supporting accelerated multi-turn inference with AsyncEngine. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b. The offload_model parameter now also offloads the reference model.
c. Optimized GPU memory management under sleep_level and offload_model parameters.
d. Added trainer_state as an input parameter to reward_funcs, making it easier to obtain the current and total training steps. - Training:
a. Reranker training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
b. CPT/SFT/DPO/GRPO pure-text large model training supports ring-attention sequence length partitioning, reducing memory usage. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
c. Channel loss in CPT/SFT training is compatible with padding_free and packing. Thanks to the technical team at China Merchants Bank for their contribution.
d. Optimized remove_unused_columns parameter. When set to False, extra dataset columns are passed to the Trainer for custom loss functions.
e. The default value forsplit_dataset_ratiochanged from 0.01 to 0, so the validation set is not split by default. You now need to manually set--split_dataset_ratioor--val_dataset.
f. Fixed loss alignment issue between packing/padding_free for multimodal models. For details, see this PR: #4838
g. Swanlab now supports Feishu (Lark Suite) notification callback after training is completed. - RLHF:
a. Pure-text and multimodal models support GKD training, with some scenarios supporting padding_free and packing. Training scripts:
i. Large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
ii. Multimodal large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
b. Reward model training now supports the margin parameter. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm - Full Pipeline:
a. SGLang inference engine can be used to accelerate ms-swift inference/deployment/evaluation/ui modules, by setting--infer_backend sglang. Inference script reference: https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
b. FP8 quantization supported. Quantization script reference: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh - Web-UI:
a. Supports SFT/RLHF/GRPO training on different Tab pages, and saves training command lines.
b. Web-UI interface supports data sampling.
New Models
- Multimodal Models:
a. ZhipuAI/GLM-4.1V-9B-Thinking series
b. Kwai-Keye/Keye-VL-8B-Preview
c. moonshotai/Kimi-VL-A3B-Thinking-2506
d. google/gemma-3n-E2B-it series - Pure Text Models:
a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT series
b. rednote-hilab/dots.llm1.inst series
c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
d. MiniMax/MiniMax-M1-80k series (inference)
e. moonshotai/Kimi-Dev-72B
f. cognitivecomputations/DeepSeek-R1-0528-AWQ
What's Changed
- fix emb script and docs by @tastelikefeet in #4521
- [grpo] update doc about move_model_batches by @hjh0119 in #4523
- fix LoraModel by @Jintao-Huang in #4536
- support cognitivecomputations/DeepSeek-R1-0528-AWQ by @Jintao-Huang in #4537
- fix: handle INFONCE_HARD_NEGATIVES as integer if provided by @dlutwy in #4545
- fix qwen3 embedding saving by @tastelikefeet in #4548
- [megatron/dpo] fix megatron packing_cache & update DPOTrainer by @Jintao-Huang in #4556
- [megatron] support DPO by @Jintao-Huang in #4193
- support dots1 by @Jintao-Huang in #4560
- [grpo] support offloading reference model by @hjh0119 in #4554
- [grpo] fix the pickle data collator by @hjh0119 in #4562
- [dataset] fix toolbench (local) by @Jintao-Huang in #4563
- [Bug]Fix ulysses train steps, embedding negative sample length by @tastelikefeet in #4565
- fix args.json by @Jintao-Huang in #4566
- [model] fix ovis gradient_checkpointing vit no_grad by @Jintao-Huang in #4571
- [megatron] Fix megatron all_reduce warning by @Jintao-Huang in #4568
- [grpo] remove data collator to top-level to avoid pickle error in spawn mode by @hjh0119 in #4582
- [grpo] model weight synchronization before first turn rollout with async generation by @hjh0119 in #4584
- [megatron] support more rope_scaling & support deepseek-r1-qwen3-8b/internlm3/mimo-7b by @Jintao-Huang in #4576
- [grpo] restore num_generations check by @hjh0119 in #4590
- fix gc_kwargs by @Jintao-Huang in #4591
- Fix UI llm_train by @slin000111 in #4592
- [mirror] update swift mirror by @Jintao-Huang in #4601
- [megatron] compat megatron-core main branch by @Jintao-Huang in #4606
- [model] support minimax by @Jintao-Huang in #4610
- Update FAQ by @slin000111 in #4612
- [megatron] fix megatron pp max_epochs by @Jintao-Huang in #4608
- Fix minimax & fix agent_template by @Jintao-Huang in #4618
- [gkd] support gkd_trainer by @Jintao-Huang in #4587
- [docs] remove Qwen3-32B-Base by @Jintao-Huang in #4621
- [ppo] fix ppo by @Jintao-Huang in #4622
- fix max_epochs tp by @Jintao-Huang in #4624
- [loss_scale] support last_round_with_ignore_empty_think for rag by @sosofun in #4623
- [rollout] swift rollout add template by @Jintao-Huang in #4626
- [doc] LaTeX rendering by @hjh0119 in #4629
- [infer/deploy/eval/app] support sglang engine by @Jintao-Huang in #3810
- update docs & shell by @Jintao-Huang in #4637
- update docs readme by @Jintao-Huang in #4639
- [docs] update qwen3 best_practice by @Jintao-Huang in #4300
- [template] optimize get_length by @Jintao-Huang in #4641
- [model] fix model_meta by @Jintao-Huang in #4647
- fix packing & load_from_cache_file by @Jintao-Huang in #4649
- fix device_map & ddp rank0 by @Jintao-Huang in #4650
- [megatron] fix eval data_collator by @Jintao-Huang in #4654
- compat megatron-core 0.11 by @Jintao-Huang in #4655
- [docs] update gkd by @Jintao-Huang in #4657
- [gkd] support use_logits_to_keep/padding_free/packing & update gkd shell by @Jintao-Huang in #4658
- [template] optimize remove_unused_columns by @Jintao-Huang in #4661
- [grpo] refactor multi turn & support async engine & refactor grpo docs by @hjh0119 in #4380
- [dataset] fix grounding_dataset by @Jintao-Huang in #4664
- [docs] update docs by @Jintao-Huang in #4665
- [channel loss]support packing & padding free by @kevssim in #4666
- docs: correct typo "resonse" to "response" by @kv-chiu in #4672
- [doc] fix image link by @hjh0119 in #4674
- [doc] fix doc by @hjh0119 in #4675
- [rollout] fix dp args by @hjh0119 in #4678
- [grpo] fix grpo pt by @Jintao-Huang in #4683
- [feat] support fine-tuning of reranker models by @0russwest0 in #4671
- fix links by @tastelikefeet in #4690
- [megatron] support DeepseekV2ForCausalLM and DeepseekV3ForCausalLM by @Jintao-Huang in #4659
- [megatron] support rednote-hilab/dots.llm1.inst by @Jintao-Huang in #4707
- [grpo] fix colocate seed by @hjh0119 in #4712
- [doc] simplify environment variables & update best practices documentation by @0russwest0 in #4715
- support Kimi-VL-A3B-Thinking-2506 & Kimi-Dev-72B by @Jintao-Huang in #4719
- [quant] Support fp8 by @Jintao-Huang in #4729
- [grpo] fix max_step for dataloader when applying sequence parallel by @0russwest0 in #4731
- [grpo] check liger & sp by @hjh0119 in #4734
- compat transformers==4.52 (vlm) by @Jintao-Huang in #4738
- [grpo]Tool rl: add reward func for ToolRL by @tpx818 in #4694
- [model] support Tencent-Hunyuan/Hunyuan-A13B-Instruct by @Jintao-Huang in #4745
- [megatron] support fp8 by @Jintao-Huang in #4730
- fix remove_unused_columns by @Jintao-Huang in #4749
- [model] support ERNIE-4.5 by @Jintao-Huang in #4757
- update wechat by @Jintao-Huang in #4769
- update megatron shell by @Jintao-Huang in #4773
- [grpo] update vllm weight sync & wake up by @hjh0119 in #4770
- [docs] fix grpo docs by @hjh0119 in #4777
- [grpo] pass trainer state to reward funcs by @hjh0119 in #4779
- [grpo] check eval_dataset length by @hjh0119 in #4781
- Fix media downloading from hf by @tastelikefeet in #4788
- update resume from checkpoint & update timeout by @Jintao-Huang in #4774
- update custom_dataset_docs by @Jintao-Huang in #4792
- fix template bug for qwen3 reranker by @0russwest0 in #4795
- [model] support GLM4.1V by @hjh0119 in #4804
- [train] Update split_dataset_ratio by @Jintao-Huang in #4798
- Refactor Web-UI by @slin000111 in #4687
- Support ring attention for llm sft/dpo/grpo (packing/padding_free only). by @0russwest0 in #4814
- [RM] support margin & update doc by @hjh0119 in #4817
- [GITHUB WORKFLOW]add close stale issues workflow by @tastelikefeet in #4816
- [rollout] fix external plugins by @hjh0119 in #4822
- [rollout] Fix non-serializable torch.dtype bug in VLLM weight sync by @hjh0119 in #4825
- [rollout] fix request from dict by @hjh0119 in #4826
- [grpo] fix apply_chat_template by @hjh0119 in #4827
- Support gemma3n by @0russwest0 in #4836
- [train] fix multimodal packing & padding_free by @Jintao-Huang in #4838
- fix multimodal padding_free prediction_step by @Jintao-Huang in #4839
- [Feature] SwanLab Lark callback by @dykderrick in #4830
- update stream & fix bugs by @Jintao-Huang in #4842
- [megatron] Fix the display issue for train_type=lora by @Jintao-Huang in #4845
- fix bug: grpo train error for deepseek model by @aacedar in #4833
- [megatron] fix eval_iters -1 by @Jintao-Huang in #4847
- [grpo] deprecated params for 3.6 by @hjh0119 in #4848
- [grpo]Fix bug when repeatedly call inputs_to_rolloutrequest by @hrz394943230 in #4823
- [grpo] fix offpolicy check by @hjh0119 in #4852
- Fix test bug by @slin000111 in #4851
- [grpo] update doc by @hjh0119 in #4853
- [template] fix qwen3 remove '' by @Jintao-Huang in #4857
- Support Kwai-Keye/Keye-VL-8B-Preview by @0russwest0 in #4856
- [dataset] fix dataset ddp write conflict by @Jintao-Huang in #4860
- [web-ui]Modify open parameter for Accordion by @slin000111 in #4859
New Contributors
- @dlutwy made their first contribution in #4545
- @sosofun made their first contribution in #4623
- @kv-chiu made their first contribution in #4672
- @0russwest0 made their first contribution in #4671
- @tpx818 made their first contribution in #4694
- @dykderrick made their first contribution in #4830
- @aacedar made their first contribution in #4833
- @hrz394943230 made their first contribution in #4823
Full Changelog: v3.5.0...v3.6.0