diff --git a/docs/source/Customization/Architecture.md b/docs/source/Customization/Architecture.md
index 736ce9e061..b8646c62f8 100644
--- a/docs/source/Customization/Architecture.md
+++ b/docs/source/Customization/Architecture.md
@@ -84,7 +84,7 @@ class CustomLossScale(LossScale):
例子中,我们更看重数学和重要两个词,因为其loss_scale为2.0。
-当然我们也需要关注`__call__`方法的核心逻辑,即loss_scale基本策略(base_strategy)all/default/last_round 对loss_scale的影响,具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段对loss_scale的影响,参考[自定义数据集文档](../Customization/Custom-dataset.md)。
+当然我们也需要关注`__call__`方法的核心逻辑,即loss_scale基本策略(base_strategy)all/default/last_round 对loss_scale的影响,具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段和'loss_scale'字段对loss_scale的影响,参考[自定义数据集文档](../Customization/Custom-dataset.md)。其中'loss_scale'字段支持为每行数据指定不同的loss计算策略(如'last_round'、'all'等),优先级高于命令行参数。
```python
if loss or loss is None and (self.base_strategy == 'all' or
(self.base_strategy == 'default' and is_assistant) or
diff --git a/docs/source/Customization/Custom-dataset.md b/docs/source/Customization/Custom-dataset.md
index 8c75b8856e..26504d0fb3 100644
--- a/docs/source/Customization/Custom-dataset.md
+++ b/docs/source/Customization/Custom-dataset.md
@@ -65,6 +65,16 @@ alpaca格式:
{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好,有什么可以帮助你的吗?", "loss": false}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2", "loss": true}]}
```
+- 支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略,该字段优先级高于命令行参数 `--loss_scale`。可以设置的值包括:`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'` 等。这使得不同数据可以灵活使用不同的 loss 策略。示例数据格式如下:
+```jsonl
+# 使用 last_round 策略:只计算最后一轮对话的loss
+{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好!"}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "last_round"}
+# 使用 all 策略:计算所有token的loss(包括system和user部分)
+{"messages": [{"role": "system", "content": "你是数学专家"}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "all"}
+# 使用组合策略:只计算最后一轮且忽略空的think标签
+{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "\n\n\n\n你好!"}], "loss_scale": "last_round+ignore_empty_think"}
+```
+
#### channel loss
如果你要使用channel loss,你需要设置`--enable_channel_loss true`,并在数据集中增加"channel"字段。channel loss兼容packing/padding_free/loss_scale等技术。
diff --git a/docs/source/Instruction/Agent-support.md b/docs/source/Instruction/Agent-support.md
index 6883a18198..906e766a78 100644
--- a/docs/source/Instruction/Agent-support.md
+++ b/docs/source/Instruction/Agent-support.md
@@ -206,6 +206,10 @@ loss_scale参数可用于调节模型输出部分在训练过程中的损失权
- 所有与正则表达式`\\s*\\s*`匹配的字符串,loss_scale为0,即不计算损失。
+3. 数据行级设置
+
+支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略,该字段优先级高于命令行参数。可以设置的值包括:`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'`、`'default+react'` 等。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
+
使用代码测试loss_scale:
```python
from swift import get_processor, get_template
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
index aba906e36d..941a5b4679 100644
--- a/docs/source/Instruction/Command-line-parameters.md
+++ b/docs/source/Instruction/Command-line-parameters.md
@@ -115,6 +115,7 @@
- 'all': 计算所有tokens的损失。(**`swift pt`默认为该值**)
- 'ignore_empty_think': 忽略空的`'\n\n\n\n'`损失计算。(满足正则匹配`'\\s*\\s*'`即可)。
- 'react', 'hermes', 'qwen': 将`tool_call`部分的loss权重调整为2。
+ - **数据行级设置**:支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略,该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
- sequence_parallel_size: 序列并行大小,默认是1。当前支持CPT/SFT/DPO/GRPO。训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel)。
- template_backend: 选择template后端,可选为'swift'、'jinja',默认为'swift'。如果使用jinja,则使用transformers的`apply_chat_template`。
- 注意:jinja的template后端只支持推理,不支持训练(无法确定损失计算的tokens范围)。
diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
index 7da461a999..8506a4e54c 100644
--- a/docs/source/Megatron-SWIFT/Command-line-parameters.md
+++ b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -100,7 +100,6 @@
- muon_tp_mode: 张量模型并行权重的 NS 计算方式。可选为'blockwise', 'duplicated', 'distributed'。默认为'blockwise'。
- muon_extra_scale_factor: Muon 更新的额外缩放因子,默认为1。
-
**checkpoint参数**:
- 🔥output_dir: checkpoint的输出目录,默认None。在训练中,若未设置该参数,则默认为`f'megatron_output/{model_suffix}'`,例如`'megatron_output/Qwen2.5-7B-Instruct'`。
- 注意:**若在多机训练时,请确保每个节点的保存路径指向相同位置**,否则你需要在训练后手动集中这些权重。
@@ -293,7 +292,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
## RLHF参数
除了继承训练参数外,还支持以下参数:
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'、'grpo'、'kto'、'rm'和'gkd'。
-- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。
+- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略,该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
- calculate_per_token_loss: 覆盖Megatron参数,默认为False。
diff --git a/docs/source_en/Customization/Architecture.md b/docs/source_en/Customization/Architecture.md
index 8570f82b61..e13d9b2a2b 100644
--- a/docs/source_en/Customization/Architecture.md
+++ b/docs/source_en/Customization/Architecture.md
@@ -85,7 +85,7 @@ The `get_loss_scale` function returns a Tuple. The first return is a list of dec
```
In the example, we place more emphasis on the words "数学" and "重要" because their loss_scale is 2.0.
-Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
+Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field and 'loss_scale' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md). The 'loss_scale' field supports specifying different loss computation strategies (e.g., 'last_round', 'all', etc.) for each data row, with higher priority than the command-line argument.
```python
if loss or loss is None and (self.base_strategy == 'all' or
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
index 42801c5bba..cec34c9b14 100644
--- a/docs/source_en/Customization/Custom-dataset.md
+++ b/docs/source_en/Customization/Custom-dataset.md
@@ -67,6 +67,17 @@ The following outlines the standard dataset format for ms-swift, where the "syst
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi, how can I help you?", "loss": false}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2", "loss": true}]}
```
+- You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field. This field has higher priority than the command-line argument `--loss_scale`. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`. This allows different data rows to use different loss strategies flexibly. Example data format:
+
+```jsonl
+# Using last_round strategy: only compute loss for the last round of conversation
+{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there!"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "last_round"}
+# Using all strategy: compute loss for all tokens (including system and user parts)
+{"messages": [{"role": "system", "content": "You are a math expert"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "all"}
+# Using combined strategy: only compute loss for the last round and ignore empty think tags
+{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "\n\n\n\nHi there!"}], "loss_scale": "last_round+ignore_empty_think"}
+```
+
#### Channel Loss
If you want to use channel loss, you need to set `--enable_channel_loss true` and add a "channel" field to your dataset. Channel loss is compatible with techniques such as packing, padding-free, and loss scaling.
diff --git a/docs/source_en/Instruction/Agent-support.md b/docs/source_en/Instruction/Agent-support.md
index 6cf6b5ce18..9644777bf0 100644
--- a/docs/source_en/Instruction/Agent-support.md
+++ b/docs/source_en/Instruction/Agent-support.md
@@ -221,6 +221,11 @@ The specific effect of this setting is:
- Any string matching the regular expression `\\s*\\s*` is assigned a `loss_scale` of 0, meaning no loss is computed for these segments.
+3. Row-level Setting
+
+You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`, `'default+react'`, etc. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
+
+
Testing loss_scale using code:
```python
from swift import get_processor, get_template
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
index d5fecb6b4e..32fb6befe3 100644
--- a/docs/source_en/Instruction/Command-line-parameters.md
+++ b/docs/source_en/Instruction/Command-line-parameters.md
@@ -115,6 +115,7 @@ The command-line arguments will be introduced in four categories: basic argument
- 'all': Calculate loss for all tokens. (**Default value for `swift pt`**)
- 'ignore_empty_think': Ignore loss computation for empty `'\n\n\n\n'` (as long as it matches the regex `'\\s*\\s*'`).
- 'react', 'hermes', 'qwen': Adjust the loss weight of the `tool_call` part to 2.
+ - **Row-level setting**: You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
- sequence_parallel_size: Size for sequence parallelism. Default is 1. Currently supported in CPT/SFT/DPO/GRPO. Training scripts can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel).
- template_backend: Backend for template processing. Options are `'swift'` or `'jinja'`. Default is `'swift'`. If `'jinja'` is used, `apply_chat_template` from Transformers will be applied.
- Note: The `'jinja'` backend only supports inference and does not support training (as it cannot determine the token ranges for loss computation).
diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
index ab0ccdb457..7bacd6b8e9 100644
--- a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
+++ b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -315,7 +315,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
In addition to inheriting the training parameters, the following parameters are also supported:
- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'grpo', 'kto', 'rm', and 'gkd' are available.
-- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'.
+- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'. You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
- calculate_per_token_loss: Overrides the Megatron parameter. Default is False.
diff --git a/swift/dataset/preprocessor/core.py b/swift/dataset/preprocessor/core.py
index 993ae4104a..141c513621 100644
--- a/swift/dataset/preprocessor/core.py
+++ b/swift/dataset/preprocessor/core.py
@@ -31,6 +31,7 @@ class RowPreprocessor:
'channel',
'margin',
'teacher_prompt',
+ 'loss_scale',
]
def __init__(self,
diff --git a/swift/loss_scale/base.py b/swift/loss_scale/base.py
index d37e4f20e9..0333438aa0 100644
--- a/swift/loss_scale/base.py
+++ b/swift/loss_scale/base.py
@@ -4,7 +4,11 @@
from typing import List, Literal, Optional, Tuple
from swift.template import ContextType, Messages, get_last_user_round
+from swift.utils import get_logger
from .utils import calculate_loss_scale
+from .mapping import get_loss_scale
+
+logger = get_logger()
ALL_BASE_STRATEGY = ['default', 'last_round', 'all']
@@ -77,6 +81,10 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
context_types: List of context types corresponding to each context, indicating
whether it's a system prompt, user query, assistant response, etc.
messages: Complete message list containing the conversation history.
+ **kwargs: Additional keyword arguments. Supports 'loss_scale' to override
+ the global loss scale strategy for this specific data row. The value
+ can be a string like 'default', 'last_round', 'all', or combined
+ strategies like 'last_round+ignore_empty_think'.
Returns:
A tuple containing:
@@ -85,6 +93,21 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
- List[float]: Loss scale values corresponding one-to-one with the
returned context list
"""
+ # Check for per-row loss_scale override in kwargs (from data row)
+ row_loss_scale = kwargs.get('loss_scale')
+ if row_loss_scale is not None:
+ # Use per-row loss_scale with higher priority than global setting
+ try:
+ loss_scale_handler = get_loss_scale(row_loss_scale)
+ # Call the handler without 'loss_scale' in kwargs to avoid infinite recursion
+ kwargs_without_loss_scale = {k: v for k, v in kwargs.items() if k != 'loss_scale'}
+ return loss_scale_handler(context_list, context_types, messages, **kwargs_without_loss_scale)
+ except (KeyError, ValueError) as e:
+ # If invalid loss_scale specified in data row, fall back to global setting
+ logger.warning(f"Invalid loss_scale '{row_loss_scale}' specified in data row, "
+ f"falling back to global setting '{self.base_strategy}'. Error: {e}")
+ pass
+
res_context_list = []
res_loss_scale = []
i = 0