modelscope · hpsun1109 · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026
diff --git a/docs/source/Customization/Architecture.md b/docs/source/Customization/Architecture.md
@@ -84,7 +84,7 @@ class CustomLossScale(LossScale):
 例子中，我们更看重数学和重要两个词，因为其loss_scale为2.0。
 
 
-当然我们也需要关注`__call__`方法的核心逻辑，即loss_scale基本策略（base_strategy）all/default/last_round 对loss_scale的影响，具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段对loss_scale的影响，参考[自定义数据集文档](../Customization/Custom-dataset.md)。
+当然我们也需要关注`__call__`方法的核心逻辑，即loss_scale基本策略（base_strategy）all/default/last_round 对loss_scale的影响，具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段和'loss_scale'字段对loss_scale的影响，参考[自定义数据集文档](../Customization/Custom-dataset.md)。其中'loss_scale'字段支持为每行数据指定不同的loss计算策略（如'last_round'、'all'等），优先级高于命令行参数。
 ```python
 if loss or loss is None and (self.base_strategy == 'all' or
                             (self.base_strategy == 'default' and is_assistant) or

diff --git a/docs/source/Customization/Custom-dataset.md b/docs/source/Customization/Custom-dataset.md
@@ -65,6 +65,16 @@ alpaca格式:
 {"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好，有什么可以帮助你的吗？", "loss": false}, {"role": "user", "content": "1+1等于几？"}, {"role": "assistant", "content": "等于2", "loss": true}]}
 ```
 
+- 支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略，该字段优先级高于命令行参数 `--loss_scale`。可以设置的值包括：`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'` 等。这使得不同数据可以灵活使用不同的 loss 策略。示例数据格式如下：
+```jsonl
+# 使用 last_round 策略：只计算最后一轮对话的loss
+{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好！"}, {"role": "user", "content": "1+1等于几？"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "last_round"}
+# 使用 all 策略：计算所有token的loss（包括system和user部分）
+{"messages": [{"role": "system", "content": "你是数学专家"}, {"role": "user", "content": "1+1等于几？"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "all"}
+# 使用组合策略：只计算最后一轮且忽略空的think标签
+{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n你好！"}], "loss_scale": "last_round+ignore_empty_think"}
+```
+
 
 #### channel loss
 如果你要使用channel loss，你需要设置`--enable_channel_loss true`，并在数据集中增加"channel"字段。channel loss兼容packing/padding_free/loss_scale等技术。

diff --git a/docs/source/Instruction/Agent-support.md b/docs/source/Instruction/Agent-support.md
@@ -206,6 +206,10 @@ loss_scale参数可用于调节模型输出部分在训练过程中的损失权
 
 - 所有与正则表达式`<think>\\s*</think>\\s*`匹配的字符串，loss_scale为0，即不计算损失。
 
+3. 数据行级设置
+
+支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略，该字段优先级高于命令行参数。可以设置的值包括：`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'`、`'default+react'` 等。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
+
 使用代码测试loss_scale:
 ```python
 from swift import get_processor, get_template

diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -115,6 +115,7 @@
   - 'all': 计算所有tokens的损失。（**`swift pt`默认为该值**）
   - 'ignore_empty_think': 忽略空的`'<think>\n\n</think>\n\n'`损失计算。（满足正则匹配`'<think>\\s*</think>\\s*'`即可）。
   - 'react', 'hermes', 'qwen': 将`tool_call`部分的loss权重调整为2。
+  - **数据行级设置**：支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略，该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
 - sequence_parallel_size: 序列并行大小，默认是1。当前支持CPT/SFT/DPO/GRPO。训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel)。
 - template_backend: 选择template后端，可选为'swift'、'jinja'，默认为'swift'。如果使用jinja，则使用transformers的`apply_chat_template`。
   - 注意：jinja的template后端只支持推理，不支持训练（无法确定损失计算的tokens范围）。

diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -100,7 +100,6 @@
 - muon_tp_mode: 张量模型并行权重的 NS 计算方式。可选为'blockwise', 'duplicated', 'distributed'。默认为'blockwise'。
 - muon_extra_scale_factor: Muon 更新的额外缩放因子，默认为1。
 
-
 **checkpoint参数**:
 - 🔥output_dir: checkpoint的输出目录，默认None。在训练中，若未设置该参数，则默认为`f'megatron_output/{model_suffix}'`，例如`'megatron_output/Qwen2.5-7B-Instruct'`。
   - 注意：**若在多机训练时，请确保每个节点的保存路径指向相同位置**，否则你需要在训练后手动集中这些权重。
@@ -293,7 +292,7 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 ## RLHF参数
 除了继承训练参数外，还支持以下参数：
 - 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'、'grpo'、'kto'、'rm'和'gkd'。
-- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。
+- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略，该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
 - calculate_per_token_loss: 覆盖Megatron参数，默认为False。
 
 

diff --git a/docs/source_en/Customization/Architecture.md b/docs/source_en/Customization/Architecture.md
@@ -85,7 +85,7 @@ The `get_loss_scale` function returns a Tuple. The first return is a list of dec
 ```
 In the example, we place more emphasis on the words "数学" and "重要" because their loss_scale is 2.0.
 
-Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
+Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field and 'loss_scale' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md). The 'loss_scale' field supports specifying different loss computation strategies (e.g., 'last_round', 'all', etc.) for each data row, with higher priority than the command-line argument.
 
 ```python
 if loss or loss is None and (self.base_strategy == 'all' or

diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -67,6 +67,17 @@ The following outlines the standard dataset format for ms-swift, where the "syst
 {"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi, how can I help you?", "loss": false}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2", "loss": true}]}
 ```
 
+- You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field. This field has higher priority than the command-line argument `--loss_scale`. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`. This allows different data rows to use different loss strategies flexibly. Example data format:
+
+```jsonl
+# Using last_round strategy: only compute loss for the last round of conversation
+{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there!"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "last_round"}
+# Using all strategy: compute loss for all tokens (including system and user parts)
+{"messages": [{"role": "system", "content": "You are a math expert"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "all"}
+# Using combined strategy: only compute loss for the last round and ignore empty think tags
+{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nHi there!"}], "loss_scale": "last_round+ignore_empty_think"}
+```
+
 #### Channel Loss
 If you want to use channel loss, you need to set `--enable_channel_loss true` and add a "channel" field to your dataset. Channel loss is compatible with techniques such as packing, padding-free, and loss scaling.
 

diff --git a/docs/source_en/Instruction/Agent-support.md b/docs/source_en/Instruction/Agent-support.md
@@ -221,6 +221,11 @@ The specific effect of this setting is:
 - Any string matching the regular expression `<think>\\s*</think>\\s*` is assigned a `loss_scale` of 0, meaning no loss is computed for these segments.
 
 
+3. Row-level Setting
+
+You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`, `'default+react'`, etc. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
+
+
 Testing loss_scale using code:
 ```python
 from swift import get_processor, get_template

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -115,6 +115,7 @@ The command-line arguments will be introduced in four categories: basic argument
   - 'all': Calculate loss for all tokens. (**Default value for `swift pt`**)
   - 'ignore_empty_think': Ignore loss computation for empty `'<think>\n\n</think>\n\n'` (as long as it matches the regex `'<think>\\s*</think>\\s*'`).
   - 'react', 'hermes', 'qwen': Adjust the loss weight of the `tool_call` part to 2.
+  - **Row-level setting**: You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
 - sequence_parallel_size: Size for sequence parallelism. Default is 1. Currently supported in CPT/SFT/DPO/GRPO. Training scripts can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel).
 - template_backend: Backend for template processing. Options are `'swift'` or `'jinja'`. Default is `'swift'`. If `'jinja'` is used, `apply_chat_template` from Transformers will be applied.
   - Note: The `'jinja'` backend only supports inference and does not support training (as it cannot determine the token ranges for loss computation).

diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -315,7 +315,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
 In addition to inheriting the training parameters, the following parameters are also supported:
 
 - 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'grpo', 'kto', 'rm', and 'gkd' are available.
-- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'.
+- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'. You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
 - calculate_per_token_loss: Overrides the Megatron parameter. Default is False.
 
 

diff --git a/swift/dataset/preprocessor/core.py b/swift/dataset/preprocessor/core.py
@@ -31,6 +31,7 @@ class RowPreprocessor:
                                 'channel',
                                 'margin',
                                 'teacher_prompt',
+                                'loss_scale',
                             ]
 
     def __init__(self,

diff --git a/swift/dataset/utils.py b/swift/dataset/utils.py
@@ -119,7 +119,11 @@ def __init__(self, template: 'Template'):
         self.template = template
 
     def preprocess(self, row: Dict[str, Any]) -> Optional[Dict[str, Any]]:
-        return self.template.encode(row, return_length=True)
+        encoded = self.template.encode(row, return_length=True)
+        # Preserve loss_scale from data row if present (for per-row loss_scale strategy)
+        if 'loss_scale' in row:
+            encoded['loss_scale'] = row['loss_scale']
 def from_dict(cls, inputs: Dict[str, Any]) -> 'StdTemplateInputs': 
     inputs = deepcopy(inputs) 
     kwargs = {} 
     for key in ['label', 'channel', 'margin', 'rejected_response']: 
         if key in inputs: 
             kwargs[key] = inputs[key] 
 def from_dict(cls, inputs: Dict[str, Any]) -> 'StdTemplateInputs': 
     inputs = deepcopy(inputs) 
     kwargs = {} 
     for key in ['label', 'channel', 'margin', 'rejected_response']: 
         if key in inputs: 
             kwargs[key] = inputs[key] 
 def from_dict(cls, inputs: Dict[str, Any]) -> 'StdTemplateInputs': 
     inputs = deepcopy(inputs) 
     kwargs = {} 
     for key in ['label', 'channel', 'margin', 'rejected_response']: 
         if key in inputs: 
             kwargs[key] = inputs[key] 
 def from_dict(cls, inputs: Dict[str, Any]) -> 'StdTemplateInputs': 
     inputs = deepcopy(inputs) 
     kwargs = {} 
     for key in ['label', 'channel', 'margin', 'rejected_response']: 
         if key in inputs: 
             kwargs[key] = inputs[key] 
+        return encoded
 
 
 class AddLengthPreprocessor(EncodePreprocessor):

diff --git a/swift/loss_scale/base.py b/swift/loss_scale/base.py
@@ -4,8 +4,11 @@
 from typing import List, Literal, Optional, Tuple
 
 from swift.template import ContextType, Messages, get_last_user_round
+from swift.utils import get_logger
 from .utils import calculate_loss_scale
 
+logger = get_logger()
+
 ALL_BASE_STRATEGY = ['default', 'last_round', 'all']
 
 
@@ -77,6 +80,10 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
                 context_types: List of context types corresponding to each context, indicating
                     whether it's a system prompt, user query, assistant response, etc.
                 messages: Complete message list containing the conversation history.
+                **kwargs: Additional keyword arguments. Supports 'loss_scale' to override
+                    the global loss scale strategy for this specific data row. The value
+                    can be a string like 'default', 'last_round', 'all', or combined
+                    strategies like 'last_round+ignore_empty_think'.
 
             Returns:
                 A tuple containing:
@@ -85,6 +92,22 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
                     - List[float]: Loss scale values corresponding one-to-one with the
                         returned context list
         """
+        # Check for per-row loss_scale override in kwargs (from data row)
 self.loss_scale: LossScale = get_loss_scale(loss_scale) 
 self.loss_scale: LossScale = get_loss_scale(loss_scale) 
 res_context_list, loss_scale_list = self.loss_scale(res_context_list, res_context_types, inputs.messages, 
                                                     **inputs.extra_kwargs) 
 self.loss_scale: LossScale = get_loss_scale(loss_scale) 
 self.loss_scale: LossScale = get_loss_scale(loss_scale) 
 res_context_list, loss_scale_list = self.loss_scale(res_context_list, res_context_types, inputs.messages, 
                                                     **inputs.extra_kwargs) 
+        row_loss_scale = kwargs.get('loss_scale')
+        if row_loss_scale is not None:
+            # Use per-row loss_scale with higher priority than global setting
+            from .mapping import get_loss_scale
+            try:
+                loss_scale_handler = get_loss_scale(row_loss_scale)
+                # Call the handler without 'loss_scale' in kwargs to avoid infinite recursion
+                kwargs_without_loss_scale = {k: v for k, v in kwargs.items() if k != 'loss_scale'}
+                return loss_scale_handler(context_list, context_types, messages, **kwargs_without_loss_scale)
+            except (KeyError, ValueError) as e:
+                # If invalid loss_scale specified in data row, fall back to global setting
+                logger.warning(f"Invalid loss_scale '{row_loss_scale}' specified in data row, "
+                               f"falling back to global setting '{self.base_strategy}'. Error: {e}")
+                pass
+
         res_context_list = []
         res_loss_scale = []
         i = 0