modelscope
diff --git a/‎docs/source/Instruction/GRPO.md‎
Lines changed: 65 additions & 2 deletions b/‎docs/source/Instruction/GRPO.md‎
Lines changed: 65 additions & 2 deletions
diff --git a/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source/Instruction/支持的模型和数据集.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/Instruction/支持的模型和数据集.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source_en/Instruction/GRPO.md‎
Lines changed: 70 additions & 2 deletions b/‎docs/source_en/Instruction/GRPO.md‎
Lines changed: 70 additions & 2 deletions
@@ -221,7 +221,7 @@ A conversation between User and Assistant. The user asks a question, and the Ass
   - vllm_limit_mm_per_prompt: vllm透传参数，默认为None.
   - vllm_enable_prefix_caching: vllm透传参数，默认为True.
   - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放.
-  - move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
+  - move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。注意：该参数仅对LoRA(PEFT)训练有意义。
   - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
   - offload_model: 是否在vLLM推理时offload 模型本身，默认为False。
   - gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
@@ -287,6 +287,55 @@ swift rlhf \
 1. 在 GRPOTrainer 中，reward_model 会依次append到 reward_funcs 中。因此，reward_weights 的顺序对应 [reward_funcs, reward_model]。
 2. reward_model_plugin 默认为 default，即使用 ORM 处理逻辑。
 
+## 多任务训练
+我们可以在数据集中添加一个用于标识任务类型的列，并在奖励函数/奖励模型插件中根据任务类型进行判断，从而实现多任务训练。假设数据集中包含数学和编程任务，比如：
+
+```
+    {"query": "Solve the equation x + 2 = 5", "solution": "3", "task": "math"},
+    {"query": "Write a function to calculate the Fibonacci sequence", "solution": "xxx", "task": "code"},
+    {"query": "What is the integral of x^2?", "solution": "xxx", "task": "math"},
+    {"query": "Implement a sorting algorithm in Python", "solution": "xxx", "task": "code"},
+```
+
+下面是针对不同任务的奖励函数的示例：
+
+```python
+from swift.plugin import ORM, orms
+import random
+
+# Math-specific reward function
+class MathRandomReward(ORM):
+  def __call__(self, completions, task, **kwargs):
+      rewards = []
+      for completion, t in zip(completions, task):
+          if t == "math":
+              import random
+              # imple math accuracy logic
+              reward = random.random()
+              rewards.append(reward)
+          else:
+              # Return None for non-math tasks
+              rewards.append(None)
+      return rewards
+
+# Coding-specific reward function
+class CodeRandomReward(ORM):
+  def __call__(self, completions, task, **kwargs):
+      rewards = []
+      for prompt, completion, t in zip(prompts, completions, task):
+          if t == "code":
+              # imple coding accuracy logic
+              reward = random.random()
+              rewards.append(reward)
+          else:
+              # Return None for non-coding tasks
+              rewards.append(None)
+      return rewards
+
+orms['math_reward'] = MathRandomReward
+orms['code_reward'] = CodeRandomReward
+```
+对于非当前任务的数据， 通过返回 None 来处理，从而使得奖励相关仅计算任务内的数据。
 
 ## DAPO
 [Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO)](https://arxiv.org/abs/2503.14476)在GRPO的基础上设置了几种trick，分别是
@@ -363,7 +412,21 @@ num_generations = 64
 
 **5. clip_ratio为什么总是1?**
 
-num_iterations = 1，async_generate = False 下为 on-policy RL，old_policy此时等于policy
+Clip机制的核心目的是限制策略更新的幅度，防止因单次更新过大而导致策略性能崩溃（即策略更新后表现急剧下降）。
+Clip操作的具体公式如下：
+$$
+L_{\text{CLIP}}(\theta) = \mathbb{E}_{t} \left[ \min\left(r_{t}(\theta) \hat{A}_{t}, \text{clip}(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{t} \right) \right]
+$$
+
+其中：$r_{t}(\theta) = \frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\text{old}}(a_{t} \mid s_{t})}$ 是重要性采样比，衡量新旧策略的差异。$\hat{A}_{t}$ 是优势函数（advantage function），表示动作的相对收益。$\epsilon$ 用于限制 $r_{t}(\theta)$ 的偏离范围。
+
+在 on-policy 训练过程中，由于每次更新都使用最新策略生成的数据，新旧策略相同，即 $\pi_{\theta} = \pi_{\text{old}}$
+
+因此重要性采样比恒为 1，此时，clip 操作不会生效。
+
+在设置以下参数情况下，算法为off-policy (near-on-policy)
+1. num_iterations > 1
+2. steps_per_generation > gradient_accumulation_steps
 
 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
 
 
@@ -150,6 +150,7 @@
 - 🔥report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb swanlab`、`--report_to all`。
 - logging_first_step: 是否记录第一个step的日志，默认为True。
 - logging_steps: 日志打印间隔，默认为5。
+- logging_dir: tensorboard日志路径。默认为None，即设置为`f'{self.output_dir}/runs'`。
 - predict_with_generate: 验证时使用生成式的方式，默认为False。
 - metric_for_best_model: 默认为None，即当`predict_with_generate`设置为False时，设置为'loss'，否则设置为'rouge-l'（在PPO训练时，不进行默认值设置；GRPO训练设置为'reward'）。
 - greater_is_better: 默认为None，即当`metric_for_best_model`含'loss'时，设置为False，否则设置为True。
@@ -351,6 +352,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 - check_model: 检查本地模型文件有损坏或修改并给出提示，默认为True。如果是断网环境，请设置为False。
 - 🔥create_checkpoint_symlink: 额外创建checkpoint软链接，方便书写自动化训练脚本。best_model和last_model的软链接路径分别为f'{output_dir}/best'和f'{output_dir}/last'。
 - loss_type: loss类型。默认为None，使用模型自带损失函数。
+- channels : 数据集包含的channel集合。默认为None。结合`--loss_type channel_loss`使用，可参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh)。
 - 🔥packing: 是否使用序列packing提升计算效率，默认为False。当前支持`swift pt/sft`。
   - 注意：使用packing请结合`--attn_impl flash_attn`使用且"transformers>=4.44"，具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)。
   - 支持的多模态模型参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
@@ -442,7 +444,7 @@ reward模型参数将在PPO、GRPO中使用。
   - vllm_limit_mm_per_prompt: vllm透传参数，默认为None。
   - vllm_enable_prefix_caching: vllm透传参数，默认为True。
   - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放
-  - move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
+  - move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。注意：该参数仅对LoRA(PEFT)训练有意义。
   - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
   - offload_model: 是否在vLLM推理时offload 模型本身，默认为False。
   - gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
 
@@ -213,6 +213,9 @@
 |[Qwen/Qwen3-235B-A22B-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-FP8)|qwen3_moe|qwen3|transformers>=4.51|&#x2718;|-|[Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8)|
 |[swift/Qwen3-30B-A3B-AWQ](https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ)|qwen3_moe|qwen3|transformers>=4.51|&#x2718;|-|[cognitivecomputations/Qwen3-30B-A3B-AWQ](https://huggingface.co/cognitivecomputations/Qwen3-30B-A3B-AWQ)|
 |[swift/Qwen3-235B-A22B-AWQ](https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ)|qwen3_moe|qwen3|transformers>=4.51|&#x2718;|-|[cognitivecomputations/Qwen3-235B-A22B-AWQ](https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ)|
+|[Qwen/Qwen3-Embedding-0.6B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)|qwen3_emb|qwen3_emb|-|&#x2718;|-|[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)|
+|[Qwen/Qwen3-Embedding-4B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B)|qwen3_emb|qwen3_emb|-|&#x2718;|-|[Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)|
+|[Qwen/Qwen3-Embedding-8B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B)|qwen3_emb|qwen3_emb|-|&#x2718;|-|[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)|
 |[iic/gte_Qwen2-1.5B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct)|qwen2_gte|dummy|-|&#x2718;|-|[Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)|
 |[iic/gte_Qwen2-7B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-7B-instruct)|qwen2_gte|dummy|-|&#x2718;|-|[Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)|
 |[codefuse-ai/CodeFuse-QWen-14B](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B)|codefuse_qwen|codefuse|-|&#x2718;|coding|[codefuse-ai/CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B)|
 
@@ -153,6 +153,7 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
 - 🔥report_to: Default value is `tensorboard`. You can also specify `--report_to tensorboard wandb swanlab` or `--report_to all`.
 - logging_first_step: Whether to log the first step, defaults to True.
 - logging_steps: Interval for logging, defaults to 5.
+- logging_dir: The path for TensorBoard logs. Defaults to None, which means it is set to `f'{self.output_dir}/runs'`.
 - predict_with_generate: Whether to use generative method during validation, default is False.
 - metric_for_best_model: Default is None, which means that when predict_with_generate is set to False, it is set to 'loss'; otherwise, it is set to 'rouge-l' (during PPO training, the default value is not set; in GRPO training, it is set to 'reward').
 - greater_is_better: Defaults to None, which sets it to False when `metric_for_best_model` contains 'loss', otherwise sets to True.
@@ -360,6 +361,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 - check_model: Check local model files for corruption or modification and give a prompt, default is True. If in an offline environment, please set to False.
 - 🔥create_checkpoint_symlink: Creates additional checkpoint symlinks to facilitate writing automated training scripts. The symlink paths for `best_model` and `last_model` are `f'{output_dir}/best'` and `f'{output_dir}/last'` respectively.
 - loss_type: Type of loss. Defaults to None, which uses the model's built-in loss function.
+- channels：Set of channels included in the dataset. Defaults to None. Used in conjunction with `--loss_type channel_loss`. Refer to [this example](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh) for more details.
 - 🔥packing: Whether to use sequence packing to improve computational efficiency. The default value is False. Currently supports `swift pt/sft`.
   - Note: When using packing, please combine it with `--attn_impl flash_attn` and ensure "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629).
   - Supported multimodal models reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
@@ -454,7 +456,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
   - vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
   - vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
   - sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
-  - move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
+  - move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
   - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM/LMDeploy. The default is `False`.
   - offload_model: Whether to offload the model itself during inference with vLLM/LMDeploy. The default is `False`.
   - gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
 
@@ -231,7 +231,7 @@ Arguments
   - vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
   - vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
   - sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
-  - move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
+  - move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
   - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
   - offload_model: Whether to offload the model itself during inference with vLLM. The default is `False`.
   - gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
@@ -301,6 +301,60 @@ Notes:
 1. In the GRPOTrainer, reward_model instances are appended sequentially to reward_funcs. Therefore, the order of reward_weights corresponds to [reward_funcs, reward_model].
 2. The default value for reward_model_plugin is default, which uses the ORM processing logic.
 
+## Multi-task training
+
+We can add a column to the dataset to identify the task type and make judgments based on the task type in the reward function/reward model plugin, thereby enabling multi-task training. Suppose the dataset contains math and programming tasks, such as:
+```
+    {"query": "Solve the equation x + 2 = 5", "solution": "3", "task": "math"},
+    {"query": "Write a function to calculate the Fibonacci sequence", "solution": "xxx", "task": "code"},
+    {"query": "What is the integral of x^2?", "solution": "xxx", "task": "math"},
+    {"query": "Implement a sorting algorithm in Python", "solution": "xxx", "task": "code"},
+```
+
+Below are examples of reward functions for different tasks:
+
+```python
+from swift.plugin import ORM, orms
+
+# Math-specific reward function
+from swift.plugin import ORM, orms
+import random
+
+# Math-specific reward function
+class MathRandomReward(ORM):
+  def __call__(self, completions, task, **kwargs):
+      rewards = []
+      for completion, t in zip(completions, task):
+          if t == "math":
+              import random
+              # imple math accuracy logic
+              reward = random.random()
+              rewards.append(reward)
+          else:
+              # Return None for non-math tasks
+              rewards.append(None)
+      return rewards
+
+# Coding-specific reward function
+class CodeRandomReward(ORM):
+  def __call__(self, completions, task, **kwargs):
+      rewards = []
+      for completion, t in zip(completions, task):
+          if t == "code":
+              # imple coding accuracy logic
+              reward = random.random()
+              rewards.append(reward)
+          else:
+              # Return None for non-coding tasks
+              rewards.append(None)
+      return rewards
+
+orms['math_reward'] = MathRandomReward
+orms['code_reward'] = CodeRandomReward
+```
+
+For data that does not belong to the current task, it is handled by returning None, ensuring that the reward calculation only applies to data within the task.
+
 
 ## DAPO
 Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) introduces several tricks based on GRPO, which are:
@@ -380,7 +434,21 @@ See reference: [issue](https://github.com/modelscope/ms-swift/issues/3912)
 
 **5. Why is clip_ratio always 1?**
 
-When num_iterations = 1 and async_generate = False, it's on-policy RL, and old_policy is equal to policy.
+The core purpose of the Clip mechanism is to limit the magnitude of policy updates, preventing a single update from being too large and causing a collapse in policy performance (i.e., a sudden drop in performance after the policy is updated). The specific formula for the Clip operation is as follows:
+
+$$
+L_{\text{CLIP}}(\theta) = \mathbb{E}_{t} \left[ \min\left(r_{t}(\theta) \hat{A}_{t}, \text{clip}(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{t} \right) \right]
+$$
+
+Where $r_{t}(\theta) = \frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\text{old}}(a_{t} \mid s_{t})}$ is the importance sampling ratio, measuring the difference between the new and old policies. $\hat{A}_{t}$ is the advantage function, representing the relative reward of an action. $\epsilon$ is used to limit the deviation range of  $r_{t}(\theta)$
+
+
+Therefore, the importance sampling is always equal to 1, and in this case, the clip operation will not take effect.
+
+Under the following parameter settings, the algorithm is off-policy (near-on-policy).
+
+1. num_iterations > 1
+2. steps_per_generation > gradient_accumulation_steps
 
 See reference: [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)