Skip to content

Commit ec74e2b

Browse files
authored
[grpo] Two-Sided Clipping for GRPO Trainer (#4450)
1 parent d3b6b56 commit ec74e2b

File tree

7 files changed

+10
-0
lines changed

7 files changed

+10
-0
lines changed

docs/source/Instruction/GRPO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,7 @@ A conversation between User and Assistant. The user asks a question, and the Ass
231231
- num_iterations: 每个批次代更新次数,默认为1。
232232
- epsilon: clip 系数,默认为0.2。
233233
- epsilon_high: upper clip 系数,默认为None,设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
234+
- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置,建议大于 1 + epsilon。默认为None。
234235
- sync_ref_model: 是否定期同步ref_model,默认为False。
235236
- ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
236237
- ref_model_sync_steps:同步频率,默认为512。

docs/source/Instruction/命令行参数.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -451,6 +451,7 @@ reward模型参数将在PPO、GRPO中使用。
451451
- num_iterations: 每个批次代更新次数,默认为1。
452452
- epsilon: clip 系数,默认为0.2。
453453
- epsilon_high: upper clip 系数,默认为None,设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
454+
- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置,建议大于 1 + epsilon。默认为None。
454455
- sync_ref_model: 是否定期同步ref_model,默认为False。
455456
- ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
456457
- ref_model_sync_steps:同步频率,默认为512。

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -468,6 +468,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
468468
- num_iterations: number of iterations per batch. Default is 1.
469469
- epsilon: epsilon value for clipping. Default is 0.2.
470470
- epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
471+
- delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).
471472
- sync_ref_model: Whether to synchronize the reference model. Default is False。
472473
- ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
473474
- ref_model_sync_steps:The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.

docs/source_en/Instruction/GRPO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,7 @@ Arguments
242242
- num_iterations: number of iterations per batch. Default is 1.
243243
- epsilon: epsilon value for clipping. Default is 0.2.
244244
- epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
245+
- delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).
245246
- sync_ref_model: Whether to synchronize the reference model. Default is False。
246247
- ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
247248
- ref_model_sync_steps:The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.

swift/llm/argument/rlhf_args.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,8 @@ def _check_grpo(self):
225225
'Please update it by running: pip install -U trl')
226226

227227
if self.use_liger_kernel:
228+
if self.delta is not None:
229+
raise ValueError('Liger loss does not support two-sided GRPO loss yet.')
228230
from trl.import_utils import is_liger_kernel_available
229231
assert is_liger_kernel_available(), (
230232
'Please install/update liger-kernel by running: pip install -U liger-kernel')

swift/trainers/arguments.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ def place_model_on_device(self):
153153
class GRPOArgumentsMixin:
154154
epsilon: float = 0.2
155155
epsilon_high: Optional[float] = None
156+
delta: Optional[float] = None
156157
top_k: int = 50
157158
top_p: float = 0.9
158159
repetition_penalty: float = 1.

swift/trainers/rlhf_trainer/grpo_trainer.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1073,6 +1073,9 @@ def _compute_loss(self, model, inputs):
10731073

10741074
coef_1 = torch.exp(per_token_logps - old_per_token_logps)
10751075
coef_2 = torch.clamp(coef_1, 1 - self.epsilon_low, 1 + self.epsilon_high)
1076+
if self.args.delta is not None:
1077+
coef_1 = torch.clamp(coef_1, max=self.args.delta)
1078+
10761079
per_token_loss1 = coef_1 * advantages.unsqueeze(1)
10771080
per_token_loss2 = coef_2 * advantages.unsqueeze(1)
10781081
per_token_loss = -torch.min(per_token_loss1, per_token_loss2)

0 commit comments

Comments
 (0)