diff --git a/chapters/en/chapter12/3a.mdx b/chapters/en/chapter12/3a.mdx index b4effb193..1ef12c17d 100644 --- a/chapters/en/chapter12/3a.mdx +++ b/chapters/en/chapter12/3a.mdx @@ -2,7 +2,7 @@ -This section dives into the technical and mathematical details of GRPO. It was authored by [Shirin Yamani](https://github.com/shirinyamani). +This section dives into the technical and mathematical details of GRPO. It was authored by Shirin Yamani. @@ -10,11 +10,11 @@ Let's deepen our understanding of GRPO so that we can improve our model's traini GRPO directly evaluates the model-generated responses by comparing them within groups of generation to optimize policy model, instead of training a separate value model (Critic). This approach leads to significant reduction in computational cost! -GRPO can be applied to any verifiable task where the correctness of the response can be determined. For instance, in math reasoning, the correctness of the response can be easily verified by comparing it to the ground truth. +GRPO can be applied to any verifiable task where the correctness of the response can be determined. For instance, in math reasoning, the correctness of the response can be easily verified by comparing it to the ground truth. Before diving into the technical details, let's visualize how GRPO works at a high level: -![deep](./img/2.jpg) +![deep](https://huggingface.co/reasoning-course/images/resolve/main/grpo/16.png) Now that we have a visual overview, let's break down how GRPO works step by step. @@ -28,14 +28,19 @@ Let's walk through each step of the algorithm in detail: The first step is to generate multiple possible answers for each question. This creates a diverse set of outputs that can be compared against each other. -For each question $q$, the model will generate $G$ outputs (group size) from the trained policy:{ ${o_1, o_2, o_3, \dots, o_G}\pi_{\theta_{\text{old}}}$ }, $G=8$ where each $o_i$ represents one completion from the model. +For each question \\( q \\), the model will generate \\( G \\) outputs (group size) from the trained policy: { \\( {o_1, o_2, o_3, \dots, o_G}\pi_{\theta_{\text{old}}} \\) }, \\( G=8 \\) where each \\( o_i \\) represents one completion from the model. -#### Example: +#### Example To make this concrete, let's look at a simple arithmetic problem: -- **Question** $q$ : $\text{Calculate}\space2 + 2 \times 6$ -- **Outputs** $(G = 8)$: $\{o_1:14 \text{ (correct)}, o_2:16 \text{ (wrong)}, o_3:10 \text{ (wrong)}, \ldots, o_8:14 \text{ (correct)}\}$ +**Question** + +\\( q \\) : \\( \text{Calculate}\space2 + 2 \times 6 \\) + +**Outputs** + +\\( (G = 8) \\): \\( \{o_1:14 \text{ (correct)}, o_2:16 \text{ (wrong)}, o_3:10 \text{ (wrong)}, \ldots, o_8:14 \text{ (correct)}\} \\) Notice how some of the generated answers are correct (14) while others are wrong (16 or 10). This diversity is crucial for the next step. @@ -43,34 +48,36 @@ Notice how some of the generated answers are correct (14) while others are wrong Once we have multiple responses, we need a way to determine which ones are better than others. This is where the advantage calculation comes in. -#### Reward Distribution: +#### Reward Distribution First, we assign a reward score to each generated response. In this example, we'll use a reward model, but as we learnt in the previous section, we can use any reward returning function. -Assign a RM score to each of the generated responses based on the correctness $r_i$ *(e.g. 1 for correct response, 0 for wrong response)* then for each of the $r_i$ calculate the following Advantage value +Assign a RM score to each of the generated responses based on the correctness \\( r_i \\) *(e.g. 1 for correct response, 0 for wrong response)* then for each of the \\( r_i \\) calculate the following Advantage value. -#### Advantage Value Formula: +#### Advantage Value Formula The key insight of GRPO is that we don't need absolute measures of quality - we can compare outputs within the same group. This is done using standardization: $$A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \ldots, r_G\})}{\text{std}(\{r_1, r_2, \ldots, r_G\})}$$ -#### Example: +#### Example Continuing with our arithmetic example for the same example above, imagine we have 8 responses, 4 of which is correct and the rest wrong, therefore; -- Group Average: $mean(r_i) = 0.5$ -- Std: $std(r_i) = 0.53$ -- Advantage Value: - - Correct response: $A_i = \frac{1 - 0.5}{0.53}= 0.94$ - - Wrong response: $A_i = \frac{0 - 0.5}{0.53}= -0.94$ -#### Interpretation: +| Metric | Value | +|--------|-------| +| Group Average | \\( mean(r_i) = 0.5 \\) | +| Standard Deviation | \\( std(r_i) = 0.53 \\) | +| Advantage Value (Correct response) | \\( A_i = \frac{1 - 0.5}{0.53}= 0.94 \\) | +| Advantage Value (Wrong response) | \\( A_i = \frac{0 - 0.5}{0.53}= -0.94 \\) | + +#### Interpretation Now that we have calculated the advantage values, let's understand what they mean: -This standardization (i.e. $A_i$ weighting) allows the model to assess each response's relative performance, guiding the optimization process to favour responses that are better than average (high reward) and discourage those that are worse. For instance if $A_i > 0$, then the $o_i$ is better response than the average level within its group; and if $A_i < 0$, then the $o_i$ then the quality of the response is less than the average (i.e. poor quality/performance). +This standardization (i.e. \\( A_i \\) weighting) allows the model to assess each response's relative performance, guiding the optimization process to favorable responses that are better than average (high reward) and discourage those that are worse. For instance if \\( A_i > 0 \\), then the \\( o_i \\) is better response than the average level within its group; and if \\( A_i < 0 \\), then the \\( o_i \\) then the quality of the response is less than the average (i.e. poor quality/performance). -For the example above, if $A_i = 0.94 \text{(correct output)}$ then during optimization steps its generation probability will be increased. +For the example above, if \\( A_i = 0.94 \text{(correct output)} \\) then during optimization steps its generation probability will be increased. With our advantage values calculated, we're now ready to update the policy. @@ -80,7 +87,7 @@ The final step is to use these advantage values to update our model so that it b The target function for policy update is: -$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} || \pi_{ref})$$ +$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$ This formula might look intimidating at first, but it's built from several components that each serve an important purpose. Let's break them down one by one. @@ -92,13 +99,14 @@ The GRPO update function combines several techniques to ensure stable and effect The probability ratio is defined as: -$\left(\frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}\right)$ +\\( \left(\frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}\right) \\) Intuitively, the formula compares how much the new model's response probability differs from the old model's response probability while incorporating a preference for responses that improve the expected outcome. -#### Interpretation: -- If $\text{ratio} > 1$, the new model assigns a higher probability to response $o_i$​ than the old model. -- If $\text{ratio} < 1$, the new model assigns a lower probability to $o_i$​ +#### Interpretation + +- If \\( \text{ratio} > 1 \\), the new model assigns a higher probability to response \\( o_i \\) than the old model. +- If \\( \text{ratio} < 1 \\), the new model assigns a lower probability to \\( o_i \\) This ratio allows us to control how much the model changes at each step, which leads us to the next component. @@ -106,22 +114,25 @@ This ratio allows us to control how much the model changes at each step, which l The clipping function is defined as: -$\text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon\right)$ +\\( \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon\right) \\) -Limit the ratio discussed above to be within $[1 - \epsilon, 1 + \epsilon]$ to avoid/control drastic changes or crazy updates and stepping too far off from the old policy. In other words, it limit how much the probability ratio can increase to help maintaining stability by avoiding updates that push the new model too far from the old one. +Limit the ratio discussed above to be within \\( [1 - \epsilon, 1 + \epsilon] \\) to avoid/control drastic changes or crazy updates and stepping too far off from the old policy. In other words, it limit how much the probability ratio can increase to help maintaining stability by avoiding updates that push the new model too far from the old one. + +#### Example (ε = 0.2) -#### Example $\space \text{suppose}(\epsilon = 0.2)$ Let's look at two different scenarios to better understand this clipping function: - **Case 1**: if the new policy has a probability of 0.9 for a specific response and the old policy has a probabiliy of 0.5, it means this response is getting reinforeced by the new policy to have higher probability, but within a controlled limit which is the clipping to tight up its hands to not get drastic - - $\text{Ratio}: \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} = \frac{0.9}{0.5} = 1.8 → \text{Clip}\space1.2$ (upper bound limit 1.2) + - \\( \text{Ratio}: \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} = \frac{0.9}{0.5} = 1.8 → \text{Clip}\space1.2 \\) (upper bound limit 1.2) - **Case 2**: If the new policy is not in favour of a response (lower probability e.g. 0.2), meaning if the response is not beneficial the increase might be incorrect, and the model would be penalized. - - $\text{Ratio}: \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} = \frac{0.2}{0.5} = 0.4 →\text{Clip}\space0.8$ (lower bound limit 0.8) -#### Interpretation: + - \\( \text{Ratio}: \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} = \frac{0.2}{0.5} = 0.4 →\text{Clip}\space0.8 \\) (lower bound limit 0.8) + +#### Interpretation + - The formula encourages the new model to favour responses that the old model underweighted **if they improve the outcome**. -- If the old model already favoured a response with a high probability, the new model can still reinforce it **but only within a controlled limit $[1 - \epsilon, 1 + \epsilon]$, $\text{(e.g., }\epsilon = 0.2, \space \text{so} \space [0.8-1.2])$**. +- If the old model already favoured a response with a high probability, the new model can still reinforce it **but only within a controlled limit \\( [1 - \epsilon, 1 + \epsilon] \\), \\( \text{(e.g., }\epsilon = 0.2, \space \text{so} \space [0.8-1.2]) \\)**. - If the old model overestimated a response that performs poorly, the new model is **discouraged** from maintaining that high probability. -- Therefore, intuitively, By incorporating the probability ratio, the objective function ensures that updates to the policy are proportional to the advantage $A_i$ while being moderated to prevent drastic changes. T +- Therefore, intuitively, By incorporating the probability ratio, the objective function ensures that updates to the policy are proportional to the advantage \\( A_i \\) while being moderated to prevent drastic changes. T While the clipping function helps prevent drastic changes, we need one more safeguard to ensure our model doesn't deviate too far from its original behavior. @@ -129,33 +140,35 @@ While the clipping function helps prevent drastic changes, we need one more safe The KL divergence term is: -$\beta D_{KL}(\pi_{\theta} || \pi_{ref})$ +\\( \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref}) \\) -In the KL divergence term, the $\pi_{ref}$ is basically the pre-update model's output, `per_token_logps` and $\pi_{\theta}$ is the new model's output, `new_per_token_logps`. Theoretically, KL divergence is minimized to prevent the model from deviating too far from its original behavior during optimization. This helps strike a balance between improving performance based on the reward signal and maintaining coherence. In this context, minimizing KL divergence reduces the risk of the model generating nonsensical text or, in the case of mathematical reasoning, producing extremely incorrect answers. +In the KL divergence term, the \\( \pi_{ref} \\) is basically the pre-update model's output, `per_token_logps` and \\( \pi_{\theta} \\) is the new model's output, `new_per_token_logps`. Theoretically, KL divergence is minimized to prevent the model from deviating too far from its original behavior during optimization. This helps strike a balance between improving performance based on the reward signal and maintaining coherence. In this context, minimizing KL divergence reduces the risk of the model generating nonsensical text or, in the case of mathematical reasoning, producing extremely incorrect answers. #### Interpretation + - A KL divergence penalty keeps the model's outputs close to its original distribution, preventing extreme shifts. - Instead of drifting towards completely irrational outputs, the model would refine its understanding while still allowing some exploration #### Math Definition + For those interested in the mathematical details, let's look at the formal definition: Recall that KL distance is defined as follows: -$$D_{KL}(P || Q) = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)}$$ +$$D_{KL}(P \|\| Q) = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)}$$ In RLHF, the two distributions of interest are often the distribution of the new model version, P(x), and a distribution of the reference policy, Q(x). -#### The Role of $\beta$ Parameter +#### The Role of β Parameter -The coefficient $\beta$ controls how strongly we enforce the KL divergence constraint: +The coefficient \\( \beta \\) controls how strongly we enforce the KL divergence constraint: -- **Higher $\beta$ (Stronger KL Penalty)** +- **Higher β (Stronger KL Penalty)** - More constraint on policy updates. The model remains close to its reference distribution. - Can slow down adaptation: The model may struggle to explore better responses. -- **Lower $\beta$ (Weaker KL Penalty)** +- **Lower β (Weaker KL Penalty)** - More freedom to update policy: The model can deviate more from the reference. - Faster adaptation but risk of instability: The model might learn reward-hacking behaviors. - Over-optimization risk: If the reward model is flawed, the policy might generate nonsensical outputs. -- **Original** [DeepSeekMath](https://arxiv.org/abs/2402.03300) paper set this $\beta= 0.04$ +- **Original** [DeepSeekMath](https://arxiv.org/abs/2402.03300) paper set this \\( \beta= 0.04 \\) Now that we understand the components of GRPO, let's see how they work together in a complete example. @@ -169,9 +182,9 @@ $$\text{Q: Calculate}\space2 + 2 \times 6$$ ### Step 1: Group Sampling -First, we generate multiple responses from our model: +First, we generate multiple responses from our model. -Generate $(G = 8)$ responses, $4$ of which are correct answer ($14, \text{reward=} 1$) and $4$ incorrect $\text{(reward= 0)}$, Therefore: +Generate \\( (G = 8) \\) responses, \\( 4 \\) of which are correct answer (\\( 14, \text{reward=} 1 \\)) and \\( 4 \\) incorrect \\( \text{(reward= 0)} \\), Therefore: $${o_1:14(correct), o_2:10 (wrong), o_3:16 (wrong), ... o_G:14(correct)}$$ @@ -179,20 +192,20 @@ $${o_1:14(correct), o_2:10 (wrong), o_3:16 (wrong), ... o_G:14(correct)}$$ Next, we calculate the advantage values to determine which responses are better than average: -- Group Average: -$$mean(r_i) = 0.5$$ -- Std: $$std(r_i) = 0.53$$ -- Advantage Value: - - Correct response: $A_i = \frac{1 - 0.5}{0.53}= 0.94$ - - Wrong response: $A_i = \frac{0 - 0.5}{0.53}= -0.94$ +| Statistic | Value | +|-----------|-------| +| Group Average | \\( mean(r_i) = 0.5 \\) | +| Standard Deviation | \\( std(r_i) = 0.53 \\) | +| Advantage Value (Correct response) | \\( A_i = \frac{1 - 0.5}{0.53}= 0.94 \\) | +| Advantage Value (Wrong response) | \\( A_i = \frac{0 - 0.5}{0.53}= -0.94 \\) | ### Step 3: Policy Update Finally, we update our model to reinforce the correct responses: -- Assuming the probability of old policy ($\pi_{\theta_{old}}$) for a correct output $o_1$ is $0.5$ and the new policy increases it to $0.7$ then: +- Assuming the probability of old policy (\\( \pi_{\theta_{old}} \\)) for a correct output \\( o_1 \\) is \\( 0.5 \\) and the new policy increases it to \\( 0.7 \\) then: $$\text{Ratio}: \frac{0.7}{0.5} = 1.4 →\text{after Clip}\space1.2 \space (\epsilon = 0.2)$$ -- Then when the target function is re-weighted, the model tends to reinforce the generation of correct output, and the $\text{KL Divergence}$ limits the deviation from the reference policy. +- Then when the target function is re-weighted, the model tends to reinforce the generation of correct output, and the \\( \text{KL Divergence} \\) limits the deviation from the reference policy. With the theoretical understanding in place, let's see how GRPO can be implemented in code. @@ -385,7 +398,6 @@ As you continue exploring GRPO, consider experimenting with different group size Happy training! 🚀 ## References - 1. [RLHF Book by Nathan Lambert](https://github.com/natolambert/rlhf-book) 2. [DeepSeek-V3 Technical Report](https://huggingface.co/papers/2412.19437) 3. [DeepSeekMath](https://huggingface.co/papers/2402.03300)