You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
INTELLECT-2 is the first globally distributed reinforcement learning training run of a 32 billion parameter language model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. The authors propose modifications to the standard GRPO training recipe, including two-sided GRPO clipping for increased training stability. To reproduce the paper's setting, use this configuration:
173
+
174
+
```python
175
+
from trl import GRPOConfig
176
+
177
+
training_args = GRPOConfig(
178
+
delta=4, # δ in section 4.1 of the paper
179
+
epsilon=0.2, # ε in section 4.1 of the paper
180
+
beta=0.001, # KL divergence coefficient in section 4.1 of the paper
181
+
num_generations=16, # responses per prompt in section 4.1 of the paper
182
+
learning_rate=3e-7, # section 4.1 of the paper
183
+
)
184
+
```
185
+
168
186
### Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
INTELLECT-2 is the first globally distributed reinforcement learning training run of a 32 billion parameter language model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. The authors propose modifications to the standard GRPO training recipe, including two-sided GRPO clipping for increased training stability. To reproduce the paper's setting, use this configuration:
598
+
VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias.
with \\( W(\tau) = \frac{\pi_\theta(\tau)}{\mu(\tau)} \\) the sequence level importance ratio, and \\( \phi(W) \\) is detached from the computation graph to serve as a gradient scaling coefficient.
581
606
582
607
```python
583
608
from trl import GRPOConfig
584
609
585
610
training_args = GRPOConfig(
586
-
delta=4, # δ in section 4.1 of the paper
587
-
epsilon=0.2, # ε in section 4.1 of the paper
588
-
beta=0.001, # KL divergence coefficient in section 4.1 of the paper
589
-
num_generations=16, # responses per prompt in section 4.1 of the paper
590
-
learning_rate=3e-7, # section 4.1 of the paper
611
+
loss_type="vespo",
612
+
use_vllm=True, # or False if not using any token-level `vllm_importance_sampling_correction` methods
613
+
vllm_importance_sampling_mode="token_truncate", # default correction mode for VESPO, `token_mask` also supported
614
+
vespo_k_pos=2.0, # power exponent (c1 in paper Section 3.4) for positive advantages
615
+
vespo_lambda_pos=3.0, # decay factor (c2 in paper Section 3.4) for positive advantages
616
+
vespo_k_neg=3.0, # power exponent (c1 in paper Section 3.4) for negative advantages
617
+
vespo_lambda_neg=2.0, # decay factor (c2 in paper Section 3.4) for negative advantages
0 commit comments