You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source_en/Instruction/Command-line-parameters.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -153,6 +153,7 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
153
153
- 🔥report_to: Default value is `tensorboard`. You can also specify `--report_to tensorboard wandb swanlab` or `--report_to all`.
154
154
- logging_first_step: Whether to log the first step, defaults to True.
155
155
- logging_steps: Interval for logging, defaults to 5.
156
+
- logging_dir: The path for TensorBoard logs. Defaults to None, which means it is set to `f'{self.output_dir}/runs'`.
156
157
- predict_with_generate: Whether to use generative method during validation, default is False.
157
158
- metric_for_best_model: Default is None, which means that when predict_with_generate is set to False, it is set to 'loss'; otherwise, it is set to 'rouge-l' (during PPO training, the default value is not set; in GRPO training, it is set to 'reward').
158
159
- greater_is_better: Defaults to None, which sets it to False when `metric_for_best_model` contains 'loss', otherwise sets to True.
@@ -360,6 +361,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
360
361
- check_model: Check local model files for corruption or modification and give a prompt, default is True. If in an offline environment, please set to False.
361
362
- 🔥create_checkpoint_symlink: Creates additional checkpoint symlinks to facilitate writing automated training scripts. The symlink paths for `best_model` and `last_model` are `f'{output_dir}/best'` and `f'{output_dir}/last'` respectively.
362
363
- loss_type: Type of loss. Defaults to None, which uses the model's built-in loss function.
364
+
- channels:Set of channels included in the dataset. Defaults to None. Used in conjunction with `--loss_type channel_loss`. Refer to [this example](https://github.com/modelscope/ms-swift/blob/main/examples/train/plugins/channel_loss.sh) for more details.
363
365
- 🔥packing: Whether to use sequence packing to improve computational efficiency. The default value is False. Currently supports `swift pt/sft`.
364
366
- Note: When using packing, please combine it with `--attn_impl flash_attn` and ensure "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629).
@@ -454,7 +456,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
454
456
- vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
455
457
- vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
456
458
- sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
457
-
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
459
+
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
458
460
- offload_optimizer: Whether to offload optimizer parameters during inference with vLLM/LMDeploy. The default is `False`.
459
461
- offload_model: Whether to offload the model itself during inference with vLLM/LMDeploy. The default is `False`.
460
462
- gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
Copy file name to clipboardExpand all lines: docs/source_en/Instruction/GRPO.md
+70-2Lines changed: 70 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -231,7 +231,7 @@ Arguments
231
231
- vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
232
232
- vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
233
233
- sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
234
-
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
234
+
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
235
235
- offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
236
236
- offload_model: Whether to offload the model itself during inference with vLLM. The default is `False`.
237
237
- gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
@@ -301,6 +301,60 @@ Notes:
301
301
1. In the GRPOTrainer, reward_model instances are appended sequentially to reward_funcs. Therefore, the order of reward_weights corresponds to [reward_funcs, reward_model].
302
302
2. The default value for reward_model_plugin is default, which uses the ORM processing logic.
303
303
304
+
## Multi-task training
305
+
306
+
We can add a column to the dataset to identify the task type and make judgments based on the task type in the reward function/reward model plugin, thereby enabling multi-task training. Suppose the dataset contains math and programming tasks, such as:
307
+
```
308
+
{"query": "Solve the equation x + 2 = 5", "solution": "3", "task": "math"},
309
+
{"query": "Write a function to calculate the Fibonacci sequence", "solution": "xxx", "task": "code"},
310
+
{"query": "What is the integral of x^2?", "solution": "xxx", "task": "math"},
311
+
{"query": "Implement a sorting algorithm in Python", "solution": "xxx", "task": "code"},
312
+
```
313
+
314
+
Below are examples of reward functions for different tasks:
315
+
316
+
```python
317
+
from swift.plugin importORM, orms
318
+
319
+
# Math-specific reward function
320
+
from swift.plugin importORM, orms
321
+
import random
322
+
323
+
# Math-specific reward function
324
+
classMathRandomReward(ORM):
325
+
def__call__(self, completions, task, **kwargs):
326
+
rewards = []
327
+
for completion, t inzip(completions, task):
328
+
if t =="math":
329
+
import random
330
+
# imple math accuracy logic
331
+
reward = random.random()
332
+
rewards.append(reward)
333
+
else:
334
+
# Return None for non-math tasks
335
+
rewards.append(None)
336
+
return rewards
337
+
338
+
# Coding-specific reward function
339
+
classCodeRandomReward(ORM):
340
+
def__call__(self, completions, task, **kwargs):
341
+
rewards = []
342
+
for completion, t inzip(completions, task):
343
+
if t =="code":
344
+
# imple coding accuracy logic
345
+
reward = random.random()
346
+
rewards.append(reward)
347
+
else:
348
+
# Return None for non-coding tasks
349
+
rewards.append(None)
350
+
return rewards
351
+
352
+
orms['math_reward'] = MathRandomReward
353
+
orms['code_reward'] = CodeRandomReward
354
+
```
355
+
356
+
For data that does not belong to the current task, it is handled by returning None, ensuring that the reward calculation only applies to data within the task.
357
+
304
358
305
359
## DAPO
306
360
Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) introduces several tricks based on GRPO, which are:
@@ -380,7 +434,21 @@ See reference: [issue](https://github.com/modelscope/ms-swift/issues/3912)
380
434
381
435
**5. Why is clip_ratio always 1?**
382
436
383
-
When num_iterations = 1 and async_generate = False, it's on-policy RL, and old_policy is equal to policy.
437
+
The core purpose of the Clip mechanism is to limit the magnitude of policy updates, preventing a single update from being too large and causing a collapse in policy performance (i.e., a sudden drop in performance after the policy is updated). The specific formula for the Clip operation is as follows:
Where $r_{t}(\theta) = \frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\text{old}}(a_{t} \mid s_{t})}$ is the importance sampling ratio, measuring the difference between the new and old policies. $\hat{A}_{t}$ is the advantage function, representing the relative reward of an action. $\epsilon$ is used to limit the deviation range of $r_{t}(\theta)$
444
+
445
+
446
+
Therefore, the importance sampling is always equal to 1, and in this case, the clip operation will not take effect.
447
+
448
+
Under the following parameter settings, the algorithm is off-policy (near-on-policy).
0 commit comments