Added new Post-training an LLM using GRPO with TRL recipe 🧑🍳️#278
Added new Post-training an LLM using GRPO with TRL recipe 🧑🍳️#278stevhliu merged 17 commits intohuggingface:mainfrom
Post-training an LLM using GRPO with TRL recipe 🧑🍳️#278Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Ready to be reviewed 😄 |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Nice! A few remarks:
- You can now use
trl==0.14(latest, released today) to use GRPO. Plus, datasets, accelerate and transformers are trl deps:
- !pip install -U -q transformers trl datasets peft accelerate
+ !pip install -U -q trl peft- You don't need to pass the tokenizer. GRPO will load it for you
- You write:
In the case of the DeepSeek-R1 training, they use an accuracy-based reward model to evaluate whether the response is correct, along with a format-based reward that ensures the model places its reasoning process between tags. You can find more details here.
Why don't you use a similar reward function as well? If you remove remove_unsused_columns=True, you'll get access to the "solution" column of the dataset in the reward function.
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
OpenAI o1-o3 models >> OpenAI o1 and o3 models
exclusively employs pure RL >> exclusively employs RL | employs pure RL
to handle more complex and nuanced tasks >> to handle complex and nuanced tasks
Maybe link to the diagram in the TRL [docs](https://huggingface.co/docs/trl/main/en/grpo_trainer)
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
Line #2. # Tested with transformers==4.48.1, trl==0.14.0.dev0, datasets==3.2.0, peft==0.14.0, accelerate==1.3.0
I think trl was just released so you can drop the dev release.
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
Line #1. !pip install git+https://github.com/huggingface/trl.git@main
As above, I think that means we can skip installing from main.
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
Line #1. print(train_dataset[0])
If you wanted, you could render this math with Ipython
from IPython.display import display, Math
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
I think this should say what the 'baseline model' is in relation to the figure above.
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
I would move this paragraph up and then introduce this notebooks implementation. By saying something like "We will simplify this slight..."
In the case of the DeepSeek-R1 training, they use an >> For training, the DeepSeek-R1 authors used an
Reply via ReviewNB
|
Looks really good. I left some small readability nits. On the title, have you thought about something that relates a bit more to the task. i.e. "Post training an LLM for reasoning with GRPO in TRL" . |
|
Can you add the notebook to https://huggingface.co/docs/trl/en/community_tutorials when it's merged? |
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
I would consider leaving only the most important text/things you want to highlight in bold so as to not make it too distracting. For example, I think text like Group Relative Policy Optimization can be bold but not necessary to bold Large Language Model.
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
To begin, we'll load the baseline model >> To begin, we'll load Qwen/Qwen2-0.5B-Instruct as the baseline model. With only 0.5 billion parameters, it is lightweight and fits within the available resources.
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
I think it'd be easier to follow if we do something like:
In this case, we will use two reward functions. The first reward function assigns higher scores to longer completions.
<code for length_reward function here>
The second reward function ensures the generation follows a specific format, using ...
<code for format_reward function here>
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
we pass a list of reward functions to the trainer that we previously defined >>> we pass the two reward functions we previously defined to the trainer
Reply via ReviewNB
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,1304 @@ | |||
| { | |||
There was a problem hiding this comment.
|
Thanks a lot for the feedback @qgallouedec @burtenshaw @stevhliu!! Really interesting suggestions that I believe improve the overall quality a lot 😄 I've incorporated your feedback and updated the recipe accordingly. Following @qgallouedec's suggestion, I introduced a third reward function for accuracy, though the results and conclusions remain largely similar. I've also restructured some sections and expanded the final part with observations that I believe will be relevant to readers. Let me know your thoughts! 😊 |
| @@ -0,0 +1,3950 @@ | |||
| { | |||
There was a problem hiding this comment.
Just to be clear, even if in the R1 paper, the completion length increases, there is no incentive, like explicit reward for this.
I think you should remove this length reward (the completion length is logged anyway if you want to monitor the completion length)
Reply via ReviewNB
|
Thanks for the feedback @qgallouedec! 😄 |
What does this PR do?
Draft! Still in progress...
Fixes #277
Who can review?
@merveenoyan and @stevhliu