modelscope
diff --git a/‎.github/workflows/close_tale_issue.yaml‎
Lines changed: 20 additions & 0 deletions b/‎.github/workflows/close_tale_issue.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎docs/source/BestPractices/Qwen3最佳实践.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/BestPractices/Qwen3最佳实践.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/Customization/自定义数据集.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/Customization/自定义数据集.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/Instruction/人类对齐.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/source/Instruction/人类对齐.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/Instruction/命令行参数.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/Instruction/支持的模型和数据集.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/Instruction/支持的模型和数据集.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source_en/BestPractices/Qwen3-Best-Practice.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source_en/BestPractices/Qwen3-Best-Practice.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source_en/Customization/Custom-dataset.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/source_en/Customization/Custom-dataset.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source_en/Instruction/Command-line-parameters.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source_en/Instruction/RLHF.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/source_en/Instruction/RLHF.md‎
Lines changed: 12 additions & 0 deletions
@@ -0,0 +1,20 @@
+name: Close Stale Issues
+on:
+  schedule:
+    - cron: '0 0 * * *'
+  workflow_dispatch:
+
+jobs:
+  close-stale:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Close stale issues
+        uses: actions/stale@v8
+        with:
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          days-before-stale: 90
+          days-before-close: 7
+          stale-issue-message: 'This issue has been inactive for over 3 months and will be automatically closed in 7 days. If this issue is still relevant, please reply to this message.'
+          close-issue-message: 'This issue has been automatically closed due to inactivity. If needed, it can be reopened.'
+          stale-issue-label: 'stale'
+          exempt-all-issue-labels: true
@@ -61,7 +61,7 @@ I am Qwen, a large-scale language model developed by Alibaba Cloud. I am designe
 
 ```bash
 pip install ms-swift -U
-pip install transformers -U
+pip install transformers
 
 pip install deepspeed # 多GPU训练
 pip install liger-kernel # 节约显存资源
 
@@ -69,6 +69,8 @@ alpaca格式:
 {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "rejected_response": "我不知道"}
 ```
 
+> 注: RM 额外支持 margin 列，参考[RM文档](../Instruction/人类对齐.md#rm)
+
 #### KTO
 
 ```jsonl
 
@@ -39,6 +39,18 @@ RLHF中的Reward Modeling阶段
 
 增加的value head权重会保存在`value_head.safetensors` 或 `value_head.bin`文件中
 
+RM损失函数如下
+
+$
+\text{loss} = -\log \sigma \left( r^{(c)} - r^{(r)} - m \right) + \lambda \left( r^{(c)} + r^{(r)} \right)^2
+$
+
+- $r^{(c)}$: 模型对 chosen response 的打分
+- $r^{(r)}$: 模型对 rejected response 的打分
+- $\lambda$: L2正则项系数，鼓励模型输出接近0，使用参数`center_rewards_coefficient`进行设置，来自[论文](https://arxiv.org/pdf/2307.09288), 默认为0
+- $m$: margin项，鼓励模型根据不同难度的样本进行区分，需要数据集中提供`margin`列，默认为0，来自[论文](https://arxiv.org/pdf/2307.09288)
+
+
 训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/rlhf/rm.sh).
 
 ## PPO
 
@@ -85,7 +85,7 @@
   - 注意：`swift pt`默认为False，使用generation模板。
 - 🔥padding_free: 将一个batch中的数据进行展平而避免数据padding，从而降低显存占用并加快训练。默认为False。当前支持CPT/SFT/DPO/GRPO/GKD。
   - 注意：使用padding_free请结合`--attn_impl flash_attn`使用且"transformers>=4.44"，具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)。（同packing）
-  - 支持的多模态模型与多模态packing支持情况相同。相较于packing，padding_free不额外消耗时间和空间。
+  - 支持的多模态模型与多模态packing支持情况相同。相较于packing，padding_free不额外消耗时间和空间。注意：请使用"ms-swift>=3.6"，关注[此PR](https://github.com/modelscope/ms-swift/pull/4838)。
   - Megatron-SWIFT默认使用padding_free，即`qkv_format='thd'`，不需要额外设置。
 - padding_side: 当训练`batch_size>=2`时的padding_side，可选值为'left'、'right'，默认为'right'。（推理时的batch_size>=2时，只进行左padding）。
   - 注意：PPO和GKD默认设置为'left'。
@@ -379,7 +379,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 - channels: 数据集包含的channel集合。默认为None。结合`--loss_type channel_loss`使用，可参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh)。
 - 🔥packing: 是否使用序列packing提升计算效率，默认为False。当前支持`swift pt/sft`。
   - 注意：使用packing请结合`--attn_impl flash_attn`使用且"transformers>=4.44"，具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)。
-  - 支持的多模态模型参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
+  - 支持的多模态模型参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh。注意：请使用"ms-swift>=3.6"，关注[此PR](https://github.com/modelscope/ms-swift/pull/4838)。
 - packing_cache: 指定 packing 缓存目录。默认值为`None`，表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时，需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量，或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
 - 🔥lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（多模态模型则包括从磁盘中读取图片）。该参数在LLM训练中默认设置为False，而MLLM训练默认为True，节约内存。
 - use_logits_to_keep: 通过在`forward`中根据labels传入logits_to_keep，减少无效logits的计算与存储，从而减少显存占用并加快训练速度。默认为None，进行自动选择。
 
@@ -508,6 +508,10 @@
 |[LLM-Research/gemma-2-27b-it](https://modelscope.cn/models/LLM-Research/gemma-2-27b-it)|gemma2|gemma|transformers>=4.42|&#x2718;|-|[google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it)|
 |[LLM-Research/gemma-3-1b-pt](https://modelscope.cn/models/LLM-Research/gemma-3-1b-pt)|gemma3_text|gemma3_text|transformers>=4.49|&#x2718;|-|[google/gemma-3-1b-pt](https://huggingface.co/google/gemma-3-1b-pt)|
 |[LLM-Research/gemma-3-1b-it](https://modelscope.cn/models/LLM-Research/gemma-3-1b-it)|gemma3_text|gemma3_text|transformers>=4.49|&#x2718;|-|[google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it)|
+|[google/gemma-3n-E2B](https://www.modelscope.cn/models/google/gemma-3n-E2B)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E2B](https://huggingface.co/google/gemma-3n-E2B)|
+|[google/gemma-3n-E2B-it](https://modelscope.cn/models/google/gemma-3n-E2B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E2B-it](https://huggingface.co/google/gemma-3n-E2B-it)|
+|[google/gemma-3n-E4B](https://www.modelscope.cn/models/google/gemma-3n-E4B)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B](https://huggingface.co/google/gemma-3n-E4B)|
+|[google/gemma-3n-E4B-it](https://www.modelscope.cn/models/google/gemma-3n-E4B-it)|gemma3n|gemma3n|transformers>=4.53.1|&#x2718;|-|[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)|
 |[skywork/Skywork-13B-base](https://modelscope.cn/models/skywork/Skywork-13B-base)|skywork|skywork|-|&#x2718;|-|[skywork/Skywork-13B-base](https://huggingface.co/skywork/Skywork-13B-base)|
 |[skywork/Skywork-13B-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat)|skywork|skywork|-|&#x2718;|-|-|
 |[AI-ModelScope/Skywork-o1-Open-Llama-3.1-8B](https://modelscope.cn/models/AI-ModelScope/Skywork-o1-Open-Llama-3.1-8B)|skywork_o1|skywork_o1|transformers>=4.43|&#x2714;|-|[Skywork/Skywork-o1-Open-Llama-3.1-8B](https://huggingface.co/Skywork/Skywork-o1-Open-Llama-3.1-8B)|
 
@@ -63,7 +63,7 @@ Before starting training, please ensure that your environment is properly config
 
 ```bash
 pip install ms-swift -U
-pip install transformers -U
+pip install transformers
 
 pip install deepspeed # for multi-GPU training
 pip install liger-kernel # to save GPU memory resources
 
@@ -70,6 +70,8 @@ The following outlines the standard dataset format for ms-swift, where the "syst
 {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "rejected_response": "I don't know"}
 ```
 
+> Note: RM additionally supports the margin column. For details, refer to the [RM documentation](../Instruction/RLHF.md#rm).
+
 #### KTO
 
 ```jsonl
 
@@ -86,7 +86,7 @@ Hints:
   - Note: `swift pt` is set to False by default, using the generation template.
 - 🔥padding_free: Flattens the data in a batch to avoid padding, thereby reducing memory usage and accelerating training. Default is False. Currently supported in CPT/SFT/DPO/GRPO/GKD.
   - Note: When using `padding_free`, it should be combined with `--attn_impl flash_attn` and "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629). (Same as packing)
-  - The supported multimodal models are the same as those supported for multimodal packing. Compared to packing, padding_free does not consume additional time or space.
+  - The supported multimodal models are the same as those supported for multimodal packing. Compared to packing, padding_free does not consume additional time or space. Note: Please use "ms-swift>=3.6" and follow [this PR](https://github.com/modelscope/ms-swift/pull/4838).
   - Megatron-SWIFT uses `padding_free` by default, i.e., `qkv_format='thd'`, and no additional configuration is required.
 - padding_side: Padding side when `batch_size>=2` during training. Options are 'left' and 'right', with 'right' as the default. (For inference with batch_size>=2, only left padding is applied.)
   - Note: PPO and GKD are set to 'left' by default.
@@ -388,7 +388,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 - channels: Set of channels included in the dataset. Defaults to None. Used in conjunction with `--loss_type channel_loss`. Refer to [this example](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh) for more details.
 - 🔥packing: Whether to use sequence packing to improve computational efficiency. The default value is False. Currently supports `swift pt/sft`.
   - Note: When using packing, please combine it with `--attn_impl flash_attn` and ensure "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629).
-  - Supported multimodal models reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
+  - Supported multimodal models reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh. Note: Please use "ms-swift>=3.6" and follow [this PR](https://github.com/modelscope/ms-swift/pull/4838).
 - packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
 - 🔥lazy_tokenize: Whether to use lazy tokenization. If set to False, all dataset samples are tokenized before training (for multimodal models, this includes reading images from disk). This parameter defaults to False for LLM training, and True for MLLM training, to save memory.
 - use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.
 
@@ -38,6 +38,18 @@ Use the base model or instruct model trained with SFT as the foundation model. A
 
 The weights of the added value head will be saved in `value_head.safetensors` or `value_head.bin`.
 
+The loss function for reward modeling is as follows:
+
+$
+\text{loss} = -\log \sigma \left( r^{(c)} - r^{(r)} - m \right) + \lambda \left( r^{(c)} + r^{(r)} \right)^2
+$
+
+- $r^{(c)}$: The score assigned by the model to the chosen response.
+- $r^{(r)}$: The score assigned by the model to the rejected response.
+- $\lambda$: L2 regularization coefficient that encourages the model outputs to be close to zero. It is set by the parameter `center_rewards_coefficient`, as described in [the paper](https://arxiv.org/pdf/2307.09288), and defaults to 0.
+- $m$: Margin term that encourages the model to distinguish between samples of different difficulty levels. The dataset needs to provide a `margin` column for this; by default, it is 0. This term is also introduced in [the paper](https://arxiv.org/pdf/2307.09288).
+
+
 Reference the training script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/rlhf/rm.sh).
 
 ## PPO