Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/Customization/Architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ class CustomLossScale(LossScale):
例子中,我们更看重数学和重要两个词,因为其loss_scale为2.0。


当然我们也需要关注`__call__`方法的核心逻辑,即loss_scale基本策略(base_strategy)all/default/last_round 对loss_scale的影响,具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段对loss_scale的影响,参考[自定义数据集文档](../Customization/Custom-dataset.md)。
当然我们也需要关注`__call__`方法的核心逻辑,即loss_scale基本策略(base_strategy)all/default/last_round 对loss_scale的影响,具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段和'loss_scale'字段对loss_scale的影响,参考[自定义数据集文档](../Customization/Custom-dataset.md)。其中'loss_scale'字段支持为每行数据指定不同的loss计算策略(如'last_round'、'all'等),优先级高于命令行参数
```python
if loss or loss is None and (self.base_strategy == 'all' or
(self.base_strategy == 'default' and is_assistant) or
Expand Down
10 changes: 10 additions & 0 deletions docs/source/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,16 @@ alpaca格式:
{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好,有什么可以帮助你的吗?", "loss": false}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2", "loss": true}]}
```

- 支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略,该字段优先级高于命令行参数 `--loss_scale`。可以设置的值包括:`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'` 等。这使得不同数据可以灵活使用不同的 loss 策略。示例数据格式如下:
```jsonl
# 使用 last_round 策略:只计算最后一轮对话的loss
{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好!"}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "last_round"}
# 使用 all 策略:计算所有token的loss(包括system和user部分)
{"messages": [{"role": "system", "content": "你是数学专家"}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "all"}
# 使用组合策略:只计算最后一轮且忽略空的think标签
{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n你好!"}], "loss_scale": "last_round+ignore_empty_think"}
```


#### channel loss
如果你要使用channel loss,你需要设置`--enable_channel_loss true`,并在数据集中增加"channel"字段。channel loss兼容packing/padding_free/loss_scale等技术。
Expand Down
4 changes: 4 additions & 0 deletions docs/source/Instruction/Agent-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,10 @@ loss_scale参数可用于调节模型输出部分在训练过程中的损失权

- 所有与正则表达式`<think>\\s*</think>\\s*`匹配的字符串,loss_scale为0,即不计算损失。

3. 数据行级设置

支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略,该字段优先级高于命令行参数。可以设置的值包括:`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'`、`'default+react'` 等。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。

使用代码测试loss_scale:
```python
from swift import get_processor, get_template
Expand Down
1 change: 1 addition & 0 deletions docs/source/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@
- 'all': 计算所有tokens的损失。(**`swift pt`默认为该值**)
- 'ignore_empty_think': 忽略空的`'<think>\n\n</think>\n\n'`损失计算。(满足正则匹配`'<think>\\s*</think>\\s*'`即可)。
- 'react', 'hermes', 'qwen': 将`tool_call`部分的loss权重调整为2。
- **数据行级设置**:支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略,该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
- sequence_parallel_size: 序列并行大小,默认是1。当前支持CPT/SFT/DPO/GRPO。训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel)。
- template_backend: 选择template后端,可选为'swift'、'jinja',默认为'swift'。如果使用jinja,则使用transformers的`apply_chat_template`。
- 注意:jinja的template后端只支持推理,不支持训练(无法确定损失计算的tokens范围)。
Expand Down
3 changes: 1 addition & 2 deletions docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@
- muon_tp_mode: 张量模型并行权重的 NS 计算方式。可选为'blockwise', 'duplicated', 'distributed'。默认为'blockwise'。
- muon_extra_scale_factor: Muon 更新的额外缩放因子,默认为1。


**checkpoint参数**:
- 🔥output_dir: checkpoint的输出目录,默认None。在训练中,若未设置该参数,则默认为`f'megatron_output/{model_suffix}'`,例如`'megatron_output/Qwen2.5-7B-Instruct'`。
- 注意:**若在多机训练时,请确保每个节点的保存路径指向相同位置**,否则你需要在训练后手动集中这些权重。
Expand Down Expand Up @@ -293,7 +292,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
## RLHF参数
除了继承训练参数外,还支持以下参数:
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'、'grpo'、'kto'、'rm'和'gkd'。
- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。
- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略,该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
- calculate_per_token_loss: 覆盖Megatron参数,默认为False。


Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Customization/Architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ The `get_loss_scale` function returns a Tuple. The first return is a list of dec
```
In the example, we place more emphasis on the words "数学" and "重要" because their loss_scale is 2.0.

Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field and 'loss_scale' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md). The 'loss_scale' field supports specifying different loss computation strategies (e.g., 'last_round', 'all', etc.) for each data row, with higher priority than the command-line argument.

```python
if loss or loss is None and (self.base_strategy == 'all' or
Expand Down
11 changes: 11 additions & 0 deletions docs/source_en/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,17 @@ The following outlines the standard dataset format for ms-swift, where the "syst
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi, how can I help you?", "loss": false}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2", "loss": true}]}
```

- You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field. This field has higher priority than the command-line argument `--loss_scale`. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`. This allows different data rows to use different loss strategies flexibly. Example data format:

```jsonl
# Using last_round strategy: only compute loss for the last round of conversation
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there!"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "last_round"}
# Using all strategy: compute loss for all tokens (including system and user parts)
{"messages": [{"role": "system", "content": "You are a math expert"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "all"}
# Using combined strategy: only compute loss for the last round and ignore empty think tags
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nHi there!"}], "loss_scale": "last_round+ignore_empty_think"}
```

#### Channel Loss
If you want to use channel loss, you need to set `--enable_channel_loss true` and add a "channel" field to your dataset. Channel loss is compatible with techniques such as packing, padding-free, and loss scaling.

Expand Down
5 changes: 5 additions & 0 deletions docs/source_en/Instruction/Agent-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,11 @@ The specific effect of this setting is:
- Any string matching the regular expression `<think>\\s*</think>\\s*` is assigned a `loss_scale` of 0, meaning no loss is computed for these segments.


3. Row-level Setting

You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`, `'default+react'`, etc. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.


Testing loss_scale using code:
```python
from swift import get_processor, get_template
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ The command-line arguments will be introduced in four categories: basic argument
- 'all': Calculate loss for all tokens. (**Default value for `swift pt`**)
- 'ignore_empty_think': Ignore loss computation for empty `'<think>\n\n</think>\n\n'` (as long as it matches the regex `'<think>\\s*</think>\\s*'`).
- 'react', 'hermes', 'qwen': Adjust the loss weight of the `tool_call` part to 2.
- **Row-level setting**: You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
- sequence_parallel_size: Size for sequence parallelism. Default is 1. Currently supported in CPT/SFT/DPO/GRPO. Training scripts can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel).
- template_backend: Backend for template processing. Options are `'swift'` or `'jinja'`. Default is `'swift'`. If `'jinja'` is used, `apply_chat_template` from Transformers will be applied.
- Note: The `'jinja'` backend only supports inference and does not support training (as it cannot determine the token ranges for loss computation).
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
In addition to inheriting the training parameters, the following parameters are also supported:

- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'grpo', 'kto', 'rm', and 'gkd' are available.
- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'.
- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'. You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
- calculate_per_token_loss: Overrides the Megatron parameter. Default is False.


Expand Down
1 change: 1 addition & 0 deletions swift/dataset/preprocessor/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class RowPreprocessor:
'channel',
'margin',
'teacher_prompt',
'loss_scale',
]

def __init__(self,
Expand Down
23 changes: 23 additions & 0 deletions swift/loss_scale/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,11 @@
from typing import List, Literal, Optional, Tuple

from swift.template import ContextType, Messages, get_last_user_round
from swift.utils import get_logger
from .utils import calculate_loss_scale
from .mapping import get_loss_scale

logger = get_logger()

ALL_BASE_STRATEGY = ['default', 'last_round', 'all']

Expand Down Expand Up @@ -77,6 +81,10 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
context_types: List of context types corresponding to each context, indicating
whether it's a system prompt, user query, assistant response, etc.
messages: Complete message list containing the conversation history.
**kwargs: Additional keyword arguments. Supports 'loss_scale' to override
the global loss scale strategy for this specific data row. The value
can be a string like 'default', 'last_round', 'all', or combined
strategies like 'last_round+ignore_empty_think'.

Returns:
A tuple containing:
Expand All @@ -85,6 +93,21 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
- List[float]: Loss scale values corresponding one-to-one with the
returned context list
"""
# Check for per-row loss_scale override in kwargs (from data row)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use different loss_scale in the template?

self.loss_scale: LossScale = get_loss_scale(loss_scale)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use different loss_scale in the template?

self.loss_scale: LossScale = get_loss_scale(loss_scale)

是的,数据中的loss_scale可以传入

res_context_list, loss_scale_list = self.loss_scale(res_context_list, res_context_types, inputs.messages,
**inputs.extra_kwargs)

row_loss_scale = kwargs.get('loss_scale')
if row_loss_scale is not None:
# Use per-row loss_scale with higher priority than global setting
try:
loss_scale_handler = get_loss_scale(row_loss_scale)
# Call the handler without 'loss_scale' in kwargs to avoid infinite recursion
kwargs_without_loss_scale = {k: v for k, v in kwargs.items() if k != 'loss_scale'}
return loss_scale_handler(context_list, context_types, messages, **kwargs_without_loss_scale)
except (KeyError, ValueError) as e:
# If invalid loss_scale specified in data row, fall back to global setting
logger.warning(f"Invalid loss_scale '{row_loss_scale}' specified in data row, "
f"falling back to global setting '{self.base_strategy}'. Error: {e}")
pass

res_context_list = []
res_loss_scale = []
i = 0
Expand Down
Loading