Skip to content

Commit f958295

Browse files
committed
Merge branch 'main' into release/3.5
2 parents c7fd1bd + 9c9e960 commit f958295

File tree

31 files changed

+461
-115
lines changed

31 files changed

+461
-115
lines changed

docs/source/Instruction/GRPO.md

Lines changed: 65 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,7 +221,7 @@ A conversation between User and Assistant. The user asks a question, and the Ass
221221
- vllm_limit_mm_per_prompt: vllm透传参数,默认为None.
222222
- vllm_enable_prefix_caching: vllm透传参数,默认为True.
223223
- sleep_level: 训练时释放 vLLM 显存,可选项为[0, 1], 默认为0,不释放.
224-
- move_model_batches: 在模型向vLLM等快速推理框架移动参数时,将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分,否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
224+
- move_model_batches: 在模型向vLLM等快速推理框架移动参数时,将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分,否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。注意:该参数仅对LoRA(PEFT)训练有意义。
225225
- offload_optimizer: 是否在vLLM推理时offload optimizer参数,默认为False。
226226
- offload_model: 是否在vLLM推理时offload 模型本身,默认为False。
227227
- gc_collect_after_offload: 是否在offload结束时进行gc(python gc和GPU gc),默认为False。
@@ -287,6 +287,55 @@ swift rlhf \
287287
1. 在 GRPOTrainer 中,reward_model 会依次append到 reward_funcs 中。因此,reward_weights 的顺序对应 [reward_funcs, reward_model]
288288
2. reward_model_plugin 默认为 default,即使用 ORM 处理逻辑。
289289

290+
## 多任务训练
291+
我们可以在数据集中添加一个用于标识任务类型的列,并在奖励函数/奖励模型插件中根据任务类型进行判断,从而实现多任务训练。假设数据集中包含数学和编程任务,比如:
292+
293+
```
294+
{"query": "Solve the equation x + 2 = 5", "solution": "3", "task": "math"},
295+
{"query": "Write a function to calculate the Fibonacci sequence", "solution": "xxx", "task": "code"},
296+
{"query": "What is the integral of x^2?", "solution": "xxx", "task": "math"},
297+
{"query": "Implement a sorting algorithm in Python", "solution": "xxx", "task": "code"},
298+
```
299+
300+
下面是针对不同任务的奖励函数的示例:
301+
302+
```python
303+
from swift.plugin import ORM, orms
304+
import random
305+
306+
# Math-specific reward function
307+
class MathRandomReward(ORM):
308+
def __call__(self, completions, task, **kwargs):
309+
rewards = []
310+
for completion, t in zip(completions, task):
311+
if t == "math":
312+
import random
313+
# imple math accuracy logic
314+
reward = random.random()
315+
rewards.append(reward)
316+
else:
317+
# Return None for non-math tasks
318+
rewards.append(None)
319+
return rewards
320+
321+
# Coding-specific reward function
322+
class CodeRandomReward(ORM):
323+
def __call__(self, completions, task, **kwargs):
324+
rewards = []
325+
for prompt, completion, t in zip(prompts, completions, task):
326+
if t == "code":
327+
# imple coding accuracy logic
328+
reward = random.random()
329+
rewards.append(reward)
330+
else:
331+
# Return None for non-coding tasks
332+
rewards.append(None)
333+
return rewards
334+
335+
orms['math_reward'] = MathRandomReward
336+
orms['code_reward'] = CodeRandomReward
337+
```
338+
对于非当前任务的数据, 通过返回 None 来处理,从而使得奖励相关仅计算任务内的数据。
290339

291340
## DAPO
292341
[Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO)](https://arxiv.org/abs/2503.14476)在GRPO的基础上设置了几种trick,分别是
@@ -363,7 +412,21 @@ num_generations = 64
363412

364413
**5. clip_ratio为什么总是1?**
365414

366-
num_iterations = 1,async_generate = False 下为 on-policy RL,old_policy此时等于policy
415+
Clip机制的核心目的是限制策略更新的幅度,防止因单次更新过大而导致策略性能崩溃(即策略更新后表现急剧下降)。
416+
Clip操作的具体公式如下:
417+
$$
418+
L_{\text{CLIP}}(\theta) = \mathbb{E}_{t} \left[ \min\left(r_{t}(\theta) \hat{A}_{t}, \text{clip}(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{t} \right) \right]
419+
$$
420+
421+
其中:$r_{t}(\theta) = \frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\text{old}}(a_{t} \mid s_{t})}$ 是重要性采样比,衡量新旧策略的差异。$\hat{A}_{t}$ 是优势函数(advantage function),表示动作的相对收益。$\epsilon$ 用于限制 $r_{t}(\theta)$ 的偏离范围。
422+
423+
在 on-policy 训练过程中,由于每次更新都使用最新策略生成的数据,新旧策略相同,即 $\pi_{\theta} = \pi_{\text{old}}$
424+
425+
因此重要性采样比恒为 1,此时,clip 操作不会生效。
426+
427+
在设置以下参数情况下,算法为off-policy (near-on-policy)
428+
1. num_iterations > 1
429+
2. steps_per_generation > gradient_accumulation_steps
367430

368431
参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
369432

docs/source/Instruction/命令行参数.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,7 @@
150150
- 🔥report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb swanlab``--report_to all`
151151
- logging_first_step: 是否记录第一个step的日志,默认为True。
152152
- logging_steps: 日志打印间隔,默认为5。
153+
- logging_dir: tensorboard日志路径。默认为None,即设置为`f'{self.output_dir}/runs'`
153154
- predict_with_generate: 验证时使用生成式的方式,默认为False。
154155
- metric_for_best_model: 默认为None,即当`predict_with_generate`设置为False时,设置为'loss',否则设置为'rouge-l'(在PPO训练时,不进行默认值设置;GRPO训练设置为'reward')。
155156
- greater_is_better: 默认为None,即当`metric_for_best_model`含'loss'时,设置为False,否则设置为True。
@@ -351,6 +352,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
351352
- check_model: 检查本地模型文件有损坏或修改并给出提示,默认为True。如果是断网环境,请设置为False。
352353
- 🔥create_checkpoint_symlink: 额外创建checkpoint软链接,方便书写自动化训练脚本。best_model和last_model的软链接路径分别为f'{output_dir}/best'和f'{output_dir}/last'。
353354
- loss_type: loss类型。默认为None,使用模型自带损失函数。
355+
- channels : 数据集包含的channel集合。默认为None。结合`--loss_type channel_loss`使用,可参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh)
354356
- 🔥packing: 是否使用序列packing提升计算效率,默认为False。当前支持`swift pt/sft`
355357
- 注意:使用packing请结合`--attn_impl flash_attn`使用且"transformers>=4.44",具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)
356358
- 支持的多模态模型参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
@@ -442,7 +444,7 @@ reward模型参数将在PPO、GRPO中使用。
442444
- vllm_limit_mm_per_prompt: vllm透传参数,默认为None。
443445
- vllm_enable_prefix_caching: vllm透传参数,默认为True。
444446
- sleep_level: 训练时释放 vLLM 显存,可选项为[0, 1], 默认为0,不释放
445-
- move_model_batches: 在模型向vLLM等快速推理框架移动参数时,将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分,否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
447+
- move_model_batches: 在模型向vLLM等快速推理框架移动参数时,将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分,否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。注意:该参数仅对LoRA(PEFT)训练有意义。
446448
- offload_optimizer: 是否在vLLM推理时offload optimizer参数,默认为False。
447449
- offload_model: 是否在vLLM推理时offload 模型本身,默认为False。
448450
- gc_collect_after_offload: 是否在offload结束时进行gc(python gc和GPU gc),默认为False。

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,9 @@
213213
|[Qwen/Qwen3-235B-A22B-FP8](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-FP8)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8)|
214214
|[swift/Qwen3-30B-A3B-AWQ](https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[cognitivecomputations/Qwen3-30B-A3B-AWQ](https://huggingface.co/cognitivecomputations/Qwen3-30B-A3B-AWQ)|
215215
|[swift/Qwen3-235B-A22B-AWQ](https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ)|qwen3_moe|qwen3|transformers>=4.51|✘|-|[cognitivecomputations/Qwen3-235B-A22B-AWQ](https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ)|
216+
|[Qwen/Qwen3-Embedding-0.6B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)|
217+
|[Qwen/Qwen3-Embedding-4B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-4B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)|
218+
|[Qwen/Qwen3-Embedding-8B](https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B)|qwen3_emb|qwen3_emb|-|✘|-|[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)|
216219
|[iic/gte_Qwen2-1.5B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)|
217220
|[iic/gte_Qwen2-7B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-7B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)|
218221
|[codefuse-ai/CodeFuse-QWen-14B](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B)|codefuse_qwen|codefuse|-|✘|coding|[codefuse-ai/CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B)|

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
153153
- 🔥report_to: Default value is `tensorboard`. You can also specify `--report_to tensorboard wandb swanlab` or `--report_to all`.
154154
- logging_first_step: Whether to log the first step, defaults to True.
155155
- logging_steps: Interval for logging, defaults to 5.
156+
- logging_dir: The path for TensorBoard logs. Defaults to None, which means it is set to `f'{self.output_dir}/runs'`.
156157
- predict_with_generate: Whether to use generative method during validation, default is False.
157158
- metric_for_best_model: Default is None, which means that when predict_with_generate is set to False, it is set to 'loss'; otherwise, it is set to 'rouge-l' (during PPO training, the default value is not set; in GRPO training, it is set to 'reward').
158159
- greater_is_better: Defaults to None, which sets it to False when `metric_for_best_model` contains 'loss', otherwise sets to True.
@@ -360,6 +361,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
360361
- check_model: Check local model files for corruption or modification and give a prompt, default is True. If in an offline environment, please set to False.
361362
- 🔥create_checkpoint_symlink: Creates additional checkpoint symlinks to facilitate writing automated training scripts. The symlink paths for `best_model` and `last_model` are `f'{output_dir}/best'` and `f'{output_dir}/last'` respectively.
362363
- loss_type: Type of loss. Defaults to None, which uses the model's built-in loss function.
364+
- channels:Set of channels included in the dataset. Defaults to None. Used in conjunction with `--loss_type channel_loss`. Refer to [this example](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh) for more details.
363365
- 🔥packing: Whether to use sequence packing to improve computational efficiency. The default value is False. Currently supports `swift pt/sft`.
364366
- Note: When using packing, please combine it with `--attn_impl flash_attn` and ensure "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629).
365367
- Supported multimodal models reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
@@ -454,7 +456,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
454456
- vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
455457
- vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
456458
- sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
457-
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
459+
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
458460
- offload_optimizer: Whether to offload optimizer parameters during inference with vLLM/LMDeploy. The default is `False`.
459461
- offload_model: Whether to offload the model itself during inference with vLLM/LMDeploy. The default is `False`.
460462
- gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.

docs/source_en/Instruction/GRPO.md

Lines changed: 70 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ Arguments
231231
- vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
232232
- vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
233233
- sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
234-
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
234+
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
235235
- offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
236236
- offload_model: Whether to offload the model itself during inference with vLLM. The default is `False`.
237237
- gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
@@ -301,6 +301,60 @@ Notes:
301301
1. In the GRPOTrainer, reward_model instances are appended sequentially to reward_funcs. Therefore, the order of reward_weights corresponds to [reward_funcs, reward_model].
302302
2. The default value for reward_model_plugin is default, which uses the ORM processing logic.
303303

304+
## Multi-task training
305+
306+
We can add a column to the dataset to identify the task type and make judgments based on the task type in the reward function/reward model plugin, thereby enabling multi-task training. Suppose the dataset contains math and programming tasks, such as:
307+
```
308+
{"query": "Solve the equation x + 2 = 5", "solution": "3", "task": "math"},
309+
{"query": "Write a function to calculate the Fibonacci sequence", "solution": "xxx", "task": "code"},
310+
{"query": "What is the integral of x^2?", "solution": "xxx", "task": "math"},
311+
{"query": "Implement a sorting algorithm in Python", "solution": "xxx", "task": "code"},
312+
```
313+
314+
Below are examples of reward functions for different tasks:
315+
316+
```python
317+
from swift.plugin import ORM, orms
318+
319+
# Math-specific reward function
320+
from swift.plugin import ORM, orms
321+
import random
322+
323+
# Math-specific reward function
324+
class MathRandomReward(ORM):
325+
def __call__(self, completions, task, **kwargs):
326+
rewards = []
327+
for completion, t in zip(completions, task):
328+
if t == "math":
329+
import random
330+
# imple math accuracy logic
331+
reward = random.random()
332+
rewards.append(reward)
333+
else:
334+
# Return None for non-math tasks
335+
rewards.append(None)
336+
return rewards
337+
338+
# Coding-specific reward function
339+
class CodeRandomReward(ORM):
340+
def __call__(self, completions, task, **kwargs):
341+
rewards = []
342+
for completion, t in zip(completions, task):
343+
if t == "code":
344+
# imple coding accuracy logic
345+
reward = random.random()
346+
rewards.append(reward)
347+
else:
348+
# Return None for non-coding tasks
349+
rewards.append(None)
350+
return rewards
351+
352+
orms['math_reward'] = MathRandomReward
353+
orms['code_reward'] = CodeRandomReward
354+
```
355+
356+
For data that does not belong to the current task, it is handled by returning None, ensuring that the reward calculation only applies to data within the task.
357+
304358

305359
## DAPO
306360
Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) introduces several tricks based on GRPO, which are:
@@ -380,7 +434,21 @@ See reference: [issue](https://github.com/modelscope/ms-swift/issues/3912)
380434

381435
**5. Why is clip_ratio always 1?**
382436

383-
When num_iterations = 1 and async_generate = False, it's on-policy RL, and old_policy is equal to policy.
437+
The core purpose of the Clip mechanism is to limit the magnitude of policy updates, preventing a single update from being too large and causing a collapse in policy performance (i.e., a sudden drop in performance after the policy is updated). The specific formula for the Clip operation is as follows:
438+
439+
$$
440+
L_{\text{CLIP}}(\theta) = \mathbb{E}_{t} \left[ \min\left(r_{t}(\theta) \hat{A}_{t}, \text{clip}(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{t} \right) \right]
441+
$$
442+
443+
Where $r_{t}(\theta) = \frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\text{old}}(a_{t} \mid s_{t})}$ is the importance sampling ratio, measuring the difference between the new and old policies. $\hat{A}_{t}$ is the advantage function, representing the relative reward of an action. $\epsilon$ is used to limit the deviation range of $r_{t}(\theta)$
444+
445+
446+
Therefore, the importance sampling is always equal to 1, and in this case, the clip operation will not take effect.
447+
448+
Under the following parameter settings, the algorithm is off-policy (near-on-policy).
449+
450+
1. num_iterations > 1
451+
2. steps_per_generation > gradient_accumulation_steps
384452

385453
See reference: [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
386454

0 commit comments

Comments
 (0)