Skip to content

Commit dc9a34b

Browse files
authored
Merge pull request huggingface#861 from huggingface/fix/broken-links-in-chapter-12
fix links in chapter 12
2 parents 9409920 + 7279630 commit dc9a34b

File tree

7 files changed

+25
-8
lines changed

7 files changed

+25
-8
lines changed

chapters/en/chapter12/1.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ Don't worry if you're missing some of these – we'll explain key concepts as we
7979

8080
<Tip>
8181

82-
If you don't have all the prerequisites, check out this [course](chapter1/1.mdx) from units 1 to 11
82+
If you don't have all the prerequisites, check out this [course](/course/chapter1/1) from units 1 to 11
8383

8484
</Tip>
8585

chapters/en/chapter12/2.mdx

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ Welcome to the first page!
55
We're going to start our journey into the exciting world of Reinforcement Learning (RL) and discover how it's revolutionizing the way we train Language Models like the ones you might use every day.
66

77
<Tip>
8+
89
In this chapter, we are focusing on reinforcement learning for language models. However, reinforcement learning is a broad field with many applications beyond language models. If you're interested in learning more about reinforcement learning, you should check out the [Deep Reinforcement Learning course](https://huggingface.co/courses/deep-rl-course/en/unit1/introduction).
10+
911
</Tip>
1012

1113
This page will give you a friendly and clear introduction to RL, even if you've never encountered it before. We'll break down the core ideas and see why RL is becoming so important in the field of Large Language Models (LLMs).
@@ -66,7 +68,7 @@ Think about learning to ride a bike. You might wobble and fall at first (negativ
6668

6769
Now, why is RL so important for Large Language Models?
6870

69-
Well, training really good LLMs is tricky. We can train them on massive amounts of text from the internet, and they become very good at predicting the next word in a sentence. This is how they learn to generate fluent and grammatically correct text, as we learned in [chapter 2](/chapters/en/chapter2/1).
71+
Well, training really good LLMs is tricky. We can train them on massive amounts of text from the internet, and they become very good at predicting the next word in a sentence. This is how they learn to generate fluent and grammatically correct text, as we learned in [chapter 2](/course/chapter2/1).
7072

7173
However, just being fluent isn't enough. We want our LLMs to be more than just good at stringing words together. We want them to be:
7274

@@ -76,7 +78,7 @@ However, just being fluent isn't enough. We want our LLMs to be more than just g
7678

7779
Pre-training LLM methods, which mostly rely on predicting the next word from text data, sometimes fall short on these aspects.
7880

79-
Whilst supervised training is excellent at producing structured outputs, it can be less effective at producing helpful, harmless, and aligned responses. We explore supervised training in [chapter 11](/chapters/en/chapter11/1).
81+
Whilst supervised training is excellent at producing structured outputs, it can be less effective at producing helpful, harmless, and aligned responses. We explore supervised training in [chapter 11](/course/chapter11/1).
8082

8183
Fine-tuned models might generate fluent and structured text that is still factually incorrect, biased, or doesn't really answer the user's question in a helpful way.
8284

chapters/en/chapter12/3.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@ In the next chapter, we will build on this knowledge and implement GRPO in pract
1111
The initial goal of the paper was to explore whether pure reinforcement learning could develop reasoning capabilities without supervised fine-tuning.
1212

1313
<Tip>
14-
Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in [chapter 11](/chapters/en/chapter11/1).
14+
15+
Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in [chapter 11](/course/chapter11/1).
16+
1517
</Tip>
1618

1719
## The Breakthrough 'Aha' Moment

chapters/en/chapter12/3a.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Advanced Understanding of Group Relative Policy Optimization (GRPO) in DeepSeekMath
22

33
<Tip>
4+
45
This section dives into the technical and mathematical details of GRPO. It was authored by [Shirin Yamani](https://github.com/shirinyamani).
6+
57
</Tip>
68

79
Let's deepen our understanding of GRPO so that we can improve our model's training process.
@@ -383,6 +385,7 @@ As you continue exploring GRPO, consider experimenting with different group size
383385
Happy training! 🚀
384386

385387
## References
388+
386389
1. [RLHF Book by Nathan Lambert](https://github.com/natolambert/rlhf-book)
387390
2. [DeepSeek-V3 Technical Report](https://huggingface.co/papers/2412.19437)
388391
3. [DeepSeekMath](https://huggingface.co/papers/2402.03300)

chapters/en/chapter12/4.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ In this page, we'll learn how to implement Group Relative Policy Optimization (G
55
We'll explore the core concepts of GRPO as they are embodied in TRL's GRPOTrainer, using snippets from the official TRL documentation to guide us.
66

77
<Tip>
8+
89
This chapter is aimed at TRL beginners. If you are already familiar with TRL, you might want to also check out the [Open R1 implementation](https://github.com/huggingface/open-r1/blob/main/src/open_r1/grpo.py) of GRPO.
10+
911
</Tip>
1012

1113
First, let's remind ourselves of some of the important concepts of GRPO algorithm:

chapters/en/chapter12/5.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@
99
Now that you've seen the theory, let's put it into practice! In this exercise, you'll fine-tune a model with GRPO.
1010

1111
<Tip>
12+
1213
This exercise was written by LLM fine-tuning expert [@mlabonne](https://huggingface.co/mlabonne).
14+
1315
</Tip>
1416

1517
## Install dependencies

chapters/en/chapter12/6.mdx

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,15 @@
66

77
# Practical Exercise: GRPO with Unsloth
88

9-
In this exercise, you'll fine-tune a model with GRPO (Group Relative Policy Optimization) using Unsloth, to improve a model's reasoning capabilities. We covered GRPO in [Chapter 3](/en/chapter3/3).
9+
In this exercise, you'll fine-tune a model with GRPO (Group Relative Policy Optimization) using Unsloth, to improve a model's reasoning capabilities. We covered GRPO in [Chapter 3](/course/chapter3/3).
1010

1111
Unsloth is a library that accelerates LLM fine-tuning, making it possible to train models faster and with less computational resources. Unsloth is plugs into TRL, so we'll build on what we learned in the previous sections, and adapt it for Unsloth specifics.
1212

1313

1414
<Tip>
15+
1516
This exercise can be run on a free Google Colab T4 GPU. For the best experience, follow along with the notebook linked above and try it out yourself.
17+
1618
</Tip>
1719

1820
## Install dependencies
@@ -71,7 +73,9 @@ model = FastLanguageModel.get_peft_model(
7173
This code loads the model in 4-bit quantization to save memory and applies LoRA (Low-Rank Adaptation) for efficient fine-tuning. The `target_modules` parameter specifies which layers of the model to fine-tune, and `use_gradient_checkpointing` enables training with longer contexts.
7274

7375
<Tip>
74-
We won't cover the details of LoRA in this chapter, but you can learn more in [Chapter 11](/en/chapter11/3).
76+
77+
We won't cover the details of LoRA in this chapter, but you can learn more in [Chapter 11](/course/chapter11/3).
78+
7579
</Tip>
7680

7781
## Data Preparation
@@ -144,7 +148,7 @@ The dataset is prepared by extracting the answer from the dataset and formatting
144148

145149
## Defining Reward Functions
146150

147-
As we discussed in [an earlier page](/en/chapter13/4), GRPO can use reward functions to guide the model's learning based on verifiable criteria like length and formatting.
151+
As we discussed in [an earlier page](/course/chapter13/4), GRPO can use reward functions to guide the model's learning based on verifiable criteria like length and formatting.
148152

149153
In this exercise, we'll define several reward functions that encourage different aspects of good reasoning. For example, we'll reward the model for providing an integer answer, and for following the strict format.
150154

@@ -219,7 +223,7 @@ These reward functions serve different purposes:
219223

220224
## Training with GRPO
221225

222-
Now we'll set up the GRPO trainer with our model, tokenizer, and reward functions. This part follows the same approach as the [previous exercise](/en/chapter12/5).
226+
Now we'll set up the GRPO trainer with our model, tokenizer, and reward functions. This part follows the same approach as the [previous exercise](/course/chapter12/5).
223227

224228
```python
225229
from trl import GRPOConfig, GRPOTrainer
@@ -276,7 +280,9 @@ trainer.train()
276280
```
277281

278282
<Tip warning={true}>
283+
279284
Training may take some time. You might not see rewards increase immediately - it can take 150-200 steps before you start seeing improvements. Be patient!
285+
280286
</Tip>
281287

282288
## Testing the Model

0 commit comments

Comments
 (0)