[RFC] verl recipe and extensions #1561
eric-haibin-lin
started this conversation in
RFC
Replies: 2 comments
-
|
I'm sorry, I'm really busy with other things, but I'll mention the first author of DrGRPO to see if he is willing to help: @lkevinzc. Currently a short intro and instructions for Dr.GRPO is in |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Since we expect recipes to show how developers can customize verl, it might be nice to have a table showing which component the recipes are customizing. Furthermore, we can decide whether a new recipe is representative based on whether it customizes a new component. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
motivation
RL training can be quite complicated. The goal of verl is to provide a RL library for LLMs that is easy to hack for researchers for exploration, meanwhile providing the right level of abstraction so that it can scale for production usage. verl needs a place to put end2end RL training pipelines that are promising for the community to reproduce and further explore, while maintaining a relatively stable verl core.
There are ongoing PRs for adding more functionality and training pipeline to verl. We need consensus on how they should be maintained, and this RFC discuss the recipes verl currently maintains.
what's good for verl community:
what we want to avoid:
proposed change
verl shall maintain a few folders:
/verl: contains the most common building blocks / utils for RL training/examples: scripts showing how to use verl for common use cases/recipe: end-to-end RL recipes from data preparation, reward definition, env/tool and the training algorithm, usually extending verl's APIs. Once a certain recipe is mature, we move shareable components back to/verl.The recipes are representative extensions. Specific environment and version / commit info are recommended to be documented to ensure reproducibility. However, in the long term the verl team will only ensure the recipes are runnable through tests but do not guarantee the full results can be reproduced across versions.
To make verl easier to extend, we need to define the interfaces. Let's break it down:
data format
The RLDataset currently provides data in the format described in https://verl.readthedocs.io/en/latest/preparation/prepare_data.html#prepare-data-for-post-training. For other tasks such as browser use / code / chat preferences, verl should document the data format and provide example Dataset implementation.
TODO:
trainer
The RayPPOTrainer provides an example training loop for a typical on-policy RL algorithm with a reward model/function. Since #1282 dataset construction is done outside of the trainer, so it is more flexible in dataset customization. For any new algorithm that requires a different training loop (e.g. async RL, DPO), or requires considerable changes to the existing trainer, we highly recommend try inheriting from the existing trainer first, or simply making a copy of the trainer for your own exploration.
The RayPPOTrainer is expected to be maintained as an API with backward compatibility. For plugins or integrations that should be added to RayPPOTrainer interface to improve its functionality/performance, please leave a comment in this thread for discussions.
TODO:
entry script
You should always make a copy of the entry script such as
main_ppo.pyshould you need to make change.recipe code structure
To add a new recipe, a few things are expected from recipe maintainers:
README.mdfile that describes what this recipe is about plus any background/paper for educational purpose/verl(unless the added functions are highly likely to be reused by other recipes).Example structure:
In the case of drgrpo, it's already included in the common ppo trainer. I recommend we document individual algorithm's introduction and usage as a standalone page under a new section
algorithmsin the verl documentation site https://verl.readthedocs.io/en/latest, such as DrGRPO, GRPO, RF++, etc. nit: in theory the compute_advantage function can be simplified further with a registration mechanism but i won't discuss it here.Other things to discuss in future RFCs:
feedback period
5/1 - 5/8
cc list
@maksimstw @zpqiu @yhyang201 @cedricbeta @vermouth1992 @hiyouga @sunjin-k @tongyx361 @ZefanW
Beta Was this translation helpful? Give feedback.
All reactions