modelscope
diff --git a/‎docs/source/BestPractices/NPU支持.md
Lines changed: 4 additions & 0 deletions b/‎docs/source/BestPractices/NPU支持.md
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/BestPractices/Qwen3最佳实践.md
Lines changed: 4 additions & 2 deletions b/‎docs/source/BestPractices/Qwen3最佳实践.md
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/source/BestPractices/快速训练VL模型.md
Lines changed: 2 additions & 0 deletions b/‎docs/source/BestPractices/快速训练VL模型.md
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/GetStarted/SWIFT安装.md
Lines changed: 4 additions & 4 deletions b/‎docs/source/GetStarted/SWIFT安装.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source/Instruction/GRPO/GetStarted/GRPO.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/Instruction/GRPO/GetStarted/GRPO.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/Instruction/Megatron-SWIFT训练.md
Lines changed: 7 additions & 4 deletions b/‎docs/source/Instruction/Megatron-SWIFT训练.md
Lines changed: 7 additions & 4 deletions
diff --git a/‎docs/source/Instruction/命令行参数.md
Lines changed: 5 additions & 3 deletions b/‎docs/source/Instruction/命令行参数.md
Lines changed: 5 additions & 3 deletions
diff --git a/‎docs/source/Instruction/评测.md
Lines changed: 4 additions & 4 deletions b/‎docs/source/Instruction/评测.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source_en/BestPractices/NPU-support.md
Lines changed: 4 additions & 0 deletions b/‎docs/source_en/BestPractices/NPU-support.md
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source_en/BestPractices/Qwen3-Best-Practice.md
Lines changed: 4 additions & 2 deletions b/‎docs/source_en/BestPractices/Qwen3-Best-Practice.md
Lines changed: 4 additions & 2 deletions
@@ -115,6 +115,7 @@ ASCEND_RT_VISIBLE_DEVICES=0 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
@@ -138,6 +139,7 @@ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
@@ -157,6 +159,7 @@ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
@@ -174,6 +177,7 @@ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
 
@@ -222,6 +222,7 @@ swift sft \
     --model Qwen/Qwen3-8B \
     --train_type full \
     --dataset '<your-dataset>' \
+    --split_dataset_ratio 0.01 \
     --torch_dtype bfloat16 \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 1 \
@@ -280,7 +281,7 @@ pip install vllm==0.8.5.post1
 
 我们使用使 AI-MO/NuminaMath-TIR 作为数据集，并使用accuracy函数计算模型回答的准确率奖励。
 
-在训练过程中，使用 vLLM 加速采样过程。通过设置 `num_infer_workers=8` ，我们为每个设备部署一个 vLLM 引擎以加快采样速度。
+在训练过程中，使用 vLLM 加速采样过程。
 
 ```bash
 # 70G*8
@@ -313,7 +314,6 @@ swift rlhf \
     --offload_model true \
     --offload_optimizer true \
     --deepspeed zero3 \
-    --num_infer_workers 8 \
     --tensor_parallel_size 1 \
     --temperature 1.0 \
     --top_p 0.85 \
@@ -332,11 +332,13 @@ ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO。支
 ```bash
 # https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
 # 请确保两个节点上的权重保存路径`--save`和packing缓存路径`--packing_cache`相同且共享。
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NNODES=$WORLD_SIZE \
 NODE_RANK=$RANK \
 megatron sft \
     --load Qwen3-30B-A3B-Base-mcore \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --split_dataset_ratio 0.01 \
     --tensor_model_parallel_size 2 \
     --expert_model_parallel_size 8 \
     --moe_grouped_gemm true \
 
@@ -114,6 +114,7 @@ swift sft \
     --model_type qwen2_5_vl \
     --train_type full \
     --dataset xxx  \
+    --split_dataset_ratio 0.01 \
     --torch_dtype bfloat16 \
     --attn_impl flash_attn \
     --freeze_vit true \
@@ -149,6 +150,7 @@ swift sft \
     --model_type qwen2_5_vl \
     --train_type full \
     --dataset xxx \
+    --split_dataset_ratio 0.01 \
     --torch_dtype bfloat16 \
     --attn_impl flash_attn \
     --freeze_vit false \
 
@@ -38,10 +38,10 @@ pip install ms-swift==2.*
 ## 镜像
 
 ```
-# swift3.5.1
-modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1
-modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1
-modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1
+# swift3.5.3
+modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py310-torch2.6.0-vllm0.8.5.post1-modelscope1.27.1-swift3.5.3
+modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py310-torch2.6.0-vllm0.8.5.post1-modelscope1.27.1-swift3.5.3
+modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py310-torch2.6.0-vllm0.8.5.post1-modelscope1.27.1-swift3.5.3
 
 # swift3.4.1.post1
 modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.26.0-swift3.4.1.post1
 
@@ -280,7 +280,7 @@ $
 
 **6. 为什么没有设置val_dataset，仍然有验证过程，如何取消**
 
-当没有显式传入`val_dataset`时，参数`split_dataset_ratio`负责切分部分`dataset`为验证数据集，默认切分1%数据
+当没有显式传入`val_dataset`时，参数`split_dataset_ratio`负责切分部分`dataset`为验证数据集，默认切分1%数据（在"ms-swift>=3.6"中，`split_dataset_ratio`的默认值将从0.01修改为0.）
 
 通过设置`--split_dataset_ratio 0` 来取消验证过程
 
 
@@ -26,9 +26,9 @@ pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.12.0
 
 或者你也可以使用镜像：
 ```
-modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1
-modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1
-modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1
+modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py310-torch2.6.0-vllm0.8.5.post1-modelscope1.27.1-swift3.5.3
+modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py310-torch2.6.0-vllm0.8.5.post1-modelscope1.27.1-swift3.5.3
+modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py310-torch2.6.0-vllm0.8.5.post1-modelscope1.27.1-swift3.5.3
 ```
 
 依赖库Megatron-LM中的训练模块将由swift进行git clone并安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径（断网环境，[core_r0.12.0分支](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.12.0)）。
@@ -52,6 +52,7 @@ swift export \
 然后，使用以下脚本进行训练，训练所需显存资源为2*80GiB：
 - 若使用多机训练，建议共享磁盘，并将`--save`指定为相同的路径。
 ```shell
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
@@ -212,6 +213,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - 🔥no_load_optim: 不载入optimizer，默认为False。
 - 🔥no_load_rng: 不载入rng，默认为False。
 - 🔥finetune: 将模型加载并微调。不加载检查点的优化器和随机种子状态，并将迭代数设置为0。默认为False。
+  - 注意：断点续训`--load`，若设置`--finetune true`，将不会跳过数据集；若不设置，将跳过之前训练的数据集数量。
+  - 流式数据集`--streaming`，暂不支持跳过数据集。
 - ckpt_format: checkpoint的格式。可选为'torch', 'torch_dist', 'zarr'。默认为'torch_dist'。
 - no_initialization: 不对权重进行初始化，默认为True。
 - auto_detect_ckpt_format: 自动检测ckpt format为legacy还是distributed格式。默认为True。
@@ -249,7 +252,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 **评估参数**:
 - 🔥eval_iters: 评估的迭代次数，默认为-1，根据验证数据集的数量设置合适的值。
   - 注意：若使用流式数据集，该值需要手动设置。
-- 🔥eval_interval: 评估的间隔（steps），默认为None，即设置为save_interval。
+- 🔥eval_interval: 评估的间隔（steps），即每训练多少steps进行评估，默认为None，即设置为save_interval。
 
 **fp8参数**:
 - fp8_format: 用于前向和反向传播中FP8张量的FP8格式方案。可选为'e4m3'，'hybrid'。默认为None。
 
@@ -46,7 +46,8 @@
   - 子数据集: 该参数只有当dataset为ID或者文件夹时生效。若注册时指定了subsets，且只有一个子数据集，则默认选择注册时指定的子数据集，否则默认为'default'。你可以使用`/`来选择多个子数据集，例如：`<dataset_id>:subset1/subset2`。你也可以使用'all'来选择所有的子数据集，例如：`<dataset_id>:all`。
   - 采样数量: 默认使用完整的数据集。若采样数少于数据样本总数，则进行随机选择（不重复采样）。若采样数高于数据样本总数，则只额外随机采样`采样数%数据样本总数`的样本，数据样本重复采样`采样数//数据样本总数`次。注意：流式数据集只进行顺序采样。若设置`--dataset_shuffle false`，则非流式数据集也进行顺序采样。
 - 🔥val_dataset: 验证集id或路径的list。默认为`[]`。
-- 🔥split_dataset_ratio: 不指定val_dataset时如何拆分训练集和验证集，默认为0.01。若不需要切分验证集，设置为0即可。
+- 🔥split_dataset_ratio: 不指定val_dataset时从训练集拆分验证集的比例，默认为0.，即不从训练集切分验证集。
+  - 注意：该参数在"ms-swift<3.6"的默认值为0.01。
 - data_seed: 数据集随机种子，默认为42。
 - 🔥dataset_num_proc: 数据集预处理的进程数，默认为1。
 - 🔥load_from_cache_file: 是否从缓存中加载数据集，默认为True。
@@ -167,6 +168,7 @@
 - 🔥save_strategy: 保存模型的策略，可选为'no'、'steps'、'epoch'，默认为'steps'。
 - 🔥save_steps: 默认为500。
 - 🔥eval_strategy: 评估策略。默认为None，跟随`save_strategy`的策略。
+  - 若不使用`val_dataset`和`eval_dataset`且`split_dataset_ratio`为0，则默认为'no'。
 - 🔥eval_steps: 默认为None，如果存在评估数据集，则跟随`save_steps`的策略。
 - 🔥save_total_limit: 最多保存的checkpoint数，会将过期的checkpoint进行删除。默认为None，保存所有的checkpoint。
 - max_steps: 最大训练的steps数。在数据集为流式时需要被设置。默认为-1。
@@ -388,8 +390,8 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 - optimizer: plugin的自定义optimizer名称，默认为None。可选optimizer参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/optimizer.py)。
 - metric: plugin的自定义metric名称。默认为None，即在predict_with_generate=False的情况下设置为'acc'，在predict_with_generate=True的情况下设置为'nlg'。
 - eval_use_evalscope: 是否使用evalscope进行训练时评测，需要设置该参数来开启评测，具体使用参考[示例](../Instruction/评测.md#训练中评测)。
-- eval_datasets: 评测数据集，可设置多个数据集，用空格分割。
-- eval_datasets_args: 评测数据集参数，json格式，可设置多个数据集的参数。
+- eval_dataset: 评测数据集，可设置多个数据集，用空格分割。
+- eval_dataset_args: 评测数据集参数，json格式，可设置多个数据集的参数。
 - eval_limit: 评测数据集采样数。
 - eval_generation_config: 评测时模型推理配置，json格式，默认为`{'max_tokens': 512}`。
 
 
@@ -131,17 +131,17 @@ swift sft \
   --eval_steps "5" \
   --per_device_eval_batch_size "5" \
   --eval_use_evalscope \
-  --eval_datasets "gsm8k" \
-  --eval_datasets_args '{"gsm8k": {"few_shot_num": 0}}' \
+  --eval_dataset "gsm8k" \
+  --eval_dataset_args '{"gsm8k": {"few_shot_num": 0}}' \
   --eval_limit "10"
 ```
 
 注意启动命令为`sft`，其中eval相关的参数有：
 - eval_strategy: 评估策略。默认为None，跟随`save_strategy`的策略
 - eval_steps: 默认为None，如果存在评估数据集，则跟随`save_steps`的策略
 - eval_use_evalscope: 是否使用evalscope进行评测，需要设置该参数来开启评测
-- eval_datasets: 评测数据集，可设置多个数据集，用空格分割
-- eval_datasets_args: 评测数据集参数，json格式，可设置多个数据集的参数
+- eval_dataset: 评测数据集，可设置多个数据集，用空格分割
+- eval_dataset_args: 评测数据集参数，json格式，可设置多个数据集的参数
 - eval_limit: 评测数据集采样数
 - eval_generation_config: 评测时模型推理配置，json格式，默认为`{'max_tokens': 512}`
 
 
@@ -115,6 +115,7 @@ ASCEND_RT_VISIBLE_DEVICES=0 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
@@ -136,6 +137,7 @@ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
@@ -154,6 +156,7 @@ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
@@ -171,6 +174,7 @@ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
 swift sft \
     --model Qwen/Qwen2-7B-Instruct \
     --dataset AI-ModelScope/blossom-math-v2 \
+    --split_dataset_ratio 0.01 \
     --num_train_epochs 5 \
     --train_type lora \
     --output_dir output \
 
@@ -225,6 +225,7 @@ swift sft \
     --model Qwen/Qwen3-8B \
     --train_type full \
     --dataset '<your-dataset>' \
+    --split_dataset_ratio 0.01 \
     --torch_dtype bfloat16 \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 1 \
@@ -284,7 +285,7 @@ Notes on dataset requirements:
 
 We use AI-MO/NuminaMath-TIR as the dataset and compute the accuracy-based reward for model responses.
 
-During training, we utilize vLLM to accelerate the sampling process. By setting `num_infer_workers=8`, we deploy one vLLM engine per device to speed up sampling.
+During training, we utilize vLLM to accelerate the sampling process.
 
 ```bash
 # 70G*8
@@ -317,7 +318,6 @@ swift rlhf \
     --offload_model true \
     --offload_optimizer true \
     --deepspeed zero3 \
-    --num_infer_workers 8 \
     --tensor_parallel_size 1 \
     --temperature 1.0 \
     --top_p 0.85 \
@@ -336,11 +336,13 @@ We will use Alibaba Cloud DLC to launch training. The training environment consi
 ```bash
 # https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
 # Ensure that the weight save path `--save` and packing cache path `--packing_cache` are the same and shared across both nodes.
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NNODES=$WORLD_SIZE \
 NODE_RANK=$RANK \
 megatron sft \
     --load Qwen3-30B-A3B-Base-mcore \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --split_dataset_ratio 0.01 \
     --tensor_model_parallel_size 2 \
     --expert_model_parallel_size 8 \
     --moe_grouped_gemm true \