Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/Customization/Architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ class CustomLossScale(LossScale):
例子中,我们更看重数学和重要两个词,因为其loss_scale为2.0。


当然我们也需要关注`__call__`方法的核心逻辑,即loss_scale基本策略(base_strategy)all/default/last_round 对loss_scale的影响,具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段对loss_scale的影响,参考[自定义数据集文档](../Customization/Custom-dataset.md)。
当然我们也需要关注`__call__`方法的核心逻辑,即loss_scale基本策略(base_strategy)all/default/last_round 对loss_scale的影响,具体参考[命令行参数文档](../Instruction/Command-line-parameters.md)的介绍。以及数据集中的'loss'字段和'loss_scale'字段对loss_scale的影响,参考[自定义数据集文档](../Customization/Custom-dataset.md)。其中'loss_scale'字段支持为每行数据指定不同的loss计算策略(如'last_round'、'all'等),优先级高于命令行参数
```python
if loss or loss is None and (self.base_strategy == 'all' or
(self.base_strategy == 'default' and is_assistant) or
Expand Down
10 changes: 10 additions & 0 deletions docs/source/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,16 @@ alpaca格式:
{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好,有什么可以帮助你的吗?", "loss": false}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2", "loss": true}]}
```

- 支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略,该字段优先级高于命令行参数 `--loss_scale`。可以设置的值包括:`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'` 等。这使得不同数据可以灵活使用不同的 loss 策略。示例数据格式如下:
```jsonl
# 使用 last_round 策略:只计算最后一轮对话的loss
{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "你好!"}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "last_round"}
# 使用 all 策略:计算所有token的loss(包括system和user部分)
{"messages": [{"role": "system", "content": "你是数学专家"}, {"role": "user", "content": "1+1等于几?"}, {"role": "assistant", "content": "等于2"}], "loss_scale": "all"}
# 使用组合策略:只计算最后一轮且忽略空的think标签
{"messages": [{"role": "user", "content": "你好"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n你好!"}], "loss_scale": "last_round+ignore_empty_think"}
```


#### channel loss
如果你要使用channel loss,你需要设置`--enable_channel_loss true`,并在数据集中增加"channel"字段。channel loss兼容packing/padding_free/loss_scale等技术。
Expand Down
4 changes: 4 additions & 0 deletions docs/source/Instruction/Agent-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,10 @@ loss_scale参数可用于调节模型输出部分在训练过程中的损失权

- 所有与正则表达式`<think>\\s*</think>\\s*`匹配的字符串,loss_scale为0,即不计算损失。

3. 数据行级设置

支持通过数据行级别的 `"loss_scale"` 字段为每行数据指定不同的 loss 计算策略,该字段优先级高于命令行参数。可以设置的值包括:`'default'`、`'last_round'`、`'all'` 以及组合策略如 `'last_round+ignore_empty_think'`、`'default+react'` 等。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。

使用代码测试loss_scale:
```python
from swift import get_processor, get_template
Expand Down
1 change: 1 addition & 0 deletions docs/source/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@
- 'all': 计算所有tokens的损失。(**`swift pt`默认为该值**)
- 'ignore_empty_think': 忽略空的`'<think>\n\n</think>\n\n'`损失计算。(满足正则匹配`'<think>\\s*</think>\\s*'`即可)。
- 'react', 'hermes', 'qwen': 将`tool_call`部分的loss权重调整为2。
- **数据行级设置**:支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略,该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
- sequence_parallel_size: 序列并行大小,默认是1。当前支持CPT/SFT/DPO/GRPO。训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel)。
- template_backend: 选择template后端,可选为'swift'、'jinja',默认为'swift'。如果使用jinja,则使用transformers的`apply_chat_template`。
- 注意:jinja的template后端只支持推理,不支持训练(无法确定损失计算的tokens范围)。
Expand Down
3 changes: 1 addition & 2 deletions docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@
- muon_tp_mode: 张量模型并行权重的 NS 计算方式。可选为'blockwise', 'duplicated', 'distributed'。默认为'blockwise'。
- muon_extra_scale_factor: Muon 更新的额外缩放因子,默认为1。


**checkpoint参数**:
- 🔥output_dir: checkpoint的输出目录,默认None。在训练中,若未设置该参数,则默认为`f'megatron_output/{model_suffix}'`,例如`'megatron_output/Qwen2.5-7B-Instruct'`。
- 注意:**若在多机训练时,请确保每个节点的保存路径指向相同位置**,否则你需要在训练后手动集中这些权重。
Expand Down Expand Up @@ -293,7 +292,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
## RLHF参数
除了继承训练参数外,还支持以下参数:
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'、'grpo'、'kto'、'rm'和'gkd'。
- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。
- loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。支持通过数据行级别的`"loss_scale"`字段为每行数据指定不同的loss计算策略,该字段优先级高于命令行参数。参考[自定义数据集文档](../Customization/Custom-dataset.md#监督微调)。
- calculate_per_token_loss: 覆盖Megatron参数,默认为False。


Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Customization/Architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ The `get_loss_scale` function returns a Tuple. The first return is a list of dec
```
In the example, we place more emphasis on the words "数学" and "重要" because their loss_scale is 2.0.

Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
Of course, we also need to pay attention to the core logic of the `__call__` method, namely the influence of the loss_scale base strategy (base_strategy) all/default/last_round on loss_scale. For details, refer to the introduction in the [Command-line Parameters Documentation](../Instruction/Command-line-parameters.md). Also, refer to the influence of the 'loss' field and 'loss_scale' field in the dataset on loss_scale in the [Custom Dataset Documentation](../Customization/Custom-dataset.md). The 'loss_scale' field supports specifying different loss computation strategies (e.g., 'last_round', 'all', etc.) for each data row, with higher priority than the command-line argument.

```python
if loss or loss is None and (self.base_strategy == 'all' or
Expand Down
11 changes: 11 additions & 0 deletions docs/source_en/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,17 @@ The following outlines the standard dataset format for ms-swift, where the "syst
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi, how can I help you?", "loss": false}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2", "loss": true}]}
```

- You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field. This field has higher priority than the command-line argument `--loss_scale`. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`. This allows different data rows to use different loss strategies flexibly. Example data format:

```jsonl
# Using last_round strategy: only compute loss for the last round of conversation
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there!"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "last_round"}
# Using all strategy: compute loss for all tokens (including system and user parts)
{"messages": [{"role": "system", "content": "You are a math expert"}, {"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "It equals 2"}], "loss_scale": "all"}
# Using combined strategy: only compute loss for the last round and ignore empty think tags
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nHi there!"}], "loss_scale": "last_round+ignore_empty_think"}
```

#### Channel Loss
If you want to use channel loss, you need to set `--enable_channel_loss true` and add a "channel" field to your dataset. Channel loss is compatible with techniques such as packing, padding-free, and loss scaling.

Expand Down
5 changes: 5 additions & 0 deletions docs/source_en/Instruction/Agent-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,11 @@ The specific effect of this setting is:
- Any string matching the regular expression `<think>\\s*</think>\\s*` is assigned a `loss_scale` of 0, meaning no loss is computed for these segments.


3. Row-level Setting

You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Supported values include: `'default'`, `'last_round'`, `'all'`, and combined strategies like `'last_round+ignore_empty_think'`, `'default+react'`, etc. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.


Testing loss_scale using code:
```python
from swift import get_processor, get_template
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ The command-line arguments will be introduced in four categories: basic argument
- 'all': Calculate loss for all tokens. (**Default value for `swift pt`**)
- 'ignore_empty_think': Ignore loss computation for empty `'<think>\n\n</think>\n\n'` (as long as it matches the regex `'<think>\\s*</think>\\s*'`).
- 'react', 'hermes', 'qwen': Adjust the loss weight of the `tool_call` part to 2.
- **Row-level setting**: You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
- sequence_parallel_size: Size for sequence parallelism. Default is 1. Currently supported in CPT/SFT/DPO/GRPO. Training scripts can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel).
- template_backend: Backend for template processing. Options are `'swift'` or `'jinja'`. Default is `'swift'`. If `'jinja'` is used, `apply_chat_template` from Transformers will be applied.
- Note: The `'jinja'` backend only supports inference and does not support training (as it cannot determine the token ranges for loss computation).
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
In addition to inheriting the training parameters, the following parameters are also supported:

- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'grpo', 'kto', 'rm', and 'gkd' are available.
- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'.
- loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'. You can also specify different loss computation strategies for each data row using the row-level `"loss_scale"` field, which has higher priority than the command-line argument. Refer to [Custom Dataset documentation](../Customization/Custom-dataset.md#supervised-fine-tuning) for details.
- calculate_per_token_loss: Overrides the Megatron parameter. Default is False.


Expand Down
1 change: 1 addition & 0 deletions swift/dataset/preprocessor/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class RowPreprocessor:
'channel',
'margin',
'teacher_prompt',
'loss_scale',
]

def __init__(self,
Expand Down
6 changes: 5 additions & 1 deletion swift/dataset/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,11 @@ def __init__(self, template: 'Template'):
self.template = template

def preprocess(self, row: Dict[str, Any]) -> Optional[Dict[str, Any]]:
return self.template.encode(row, return_length=True)
encoded = self.template.encode(row, return_length=True)
# Preserve loss_scale from data row if present (for per-row loss_scale strategy)
if 'loss_scale' in row:
encoded['loss_scale'] = row['loss_scale']
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented by modifying here?

def from_dict(cls, inputs: Dict[str, Any]) -> 'StdTemplateInputs':
inputs = deepcopy(inputs)
kwargs = {}
for key in ['label', 'channel', 'margin', 'rejected_response']:
if key in inputs:
kwargs[key] = inputs[key]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented by modifying here?

def from_dict(cls, inputs: Dict[str, Any]) -> 'StdTemplateInputs':
inputs = deepcopy(inputs)
kwargs = {}
for key in ['label', 'channel', 'margin', 'rejected_response']:
if key in inputs:
kwargs[key] = inputs[key]

这部分代码不影响,已还原

return encoded


class AddLengthPreprocessor(EncodePreprocessor):
Expand Down
23 changes: 23 additions & 0 deletions swift/loss_scale/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@
from typing import List, Literal, Optional, Tuple

from swift.template import ContextType, Messages, get_last_user_round
from swift.utils import get_logger
from .utils import calculate_loss_scale

logger = get_logger()

ALL_BASE_STRATEGY = ['default', 'last_round', 'all']


Expand Down Expand Up @@ -77,6 +80,10 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
context_types: List of context types corresponding to each context, indicating
whether it's a system prompt, user query, assistant response, etc.
messages: Complete message list containing the conversation history.
**kwargs: Additional keyword arguments. Supports 'loss_scale' to override
the global loss scale strategy for this specific data row. The value
can be a string like 'default', 'last_round', 'all', or combined
strategies like 'last_round+ignore_empty_think'.

Returns:
A tuple containing:
Expand All @@ -85,6 +92,22 @@ def __call__(self, context_list: List[str], context_types: List[ContextType], me
- List[float]: Loss scale values corresponding one-to-one with the
returned context list
"""
# Check for per-row loss_scale override in kwargs (from data row)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use different loss_scale in the template?

self.loss_scale: LossScale = get_loss_scale(loss_scale)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use different loss_scale in the template?

self.loss_scale: LossScale = get_loss_scale(loss_scale)

是的,数据中的loss_scale可以传入

res_context_list, loss_scale_list = self.loss_scale(res_context_list, res_context_types, inputs.messages,
**inputs.extra_kwargs)

row_loss_scale = kwargs.get('loss_scale')
if row_loss_scale is not None:
# Use per-row loss_scale with higher priority than global setting
from .mapping import get_loss_scale
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

from .mapping import get_loss_scale 语句移动到文件顶部(模块级别)。在方法内部导入模块会导致每次调用该方法时都重新执行导入操作,这会降低性能并可能导致意外行为。将所有导入语句放在文件顶部是 Python 的最佳实践,可以提高代码的可读性和效率。

from swift.template import ContextType, Messages, get_last_user_round
from swift.utils import get_logger
from .utils import calculate_loss_scale
from .mapping import get_loss_scale

try:
loss_scale_handler = get_loss_scale(row_loss_scale)
# Call the handler without 'loss_scale' in kwargs to avoid infinite recursion
kwargs_without_loss_scale = {k: v for k, v in kwargs.items() if k != 'loss_scale'}
return loss_scale_handler(context_list, context_types, messages, **kwargs_without_loss_scale)
except (KeyError, ValueError) as e:
# If invalid loss_scale specified in data row, fall back to global setting
logger.warning(f"Invalid loss_scale '{row_loss_scale}' specified in data row, "
f"falling back to global setting '{self.base_strategy}'. Error: {e}")
pass

res_context_list = []
res_loss_scale = []
i = 0
Expand Down
Loading