[RFC] verl recipe and extensions #1561

eric-haibin-lin · 2025-05-02T00:15:14Z

eric-haibin-lin
May 2, 2025
Maintainer

motivation

RL training can be quite complicated. The goal of verl is to provide a RL library for LLMs that is easy to hack for researchers for exploration, meanwhile providing the right level of abstraction so that it can scale for production usage. verl needs a place to put end2end RL training pipelines that are promising for the community to reproduce and further explore, while maintaining a relatively stable verl core.

There are ongoing PRs for adding more functionality and training pipeline to verl. We need consensus on how they should be maintained, and this RFC discuss the recipes verl currently maintains.

what's good for verl community:

representative extensions for a type of task / training paradiagm

what we want to avoid:

too many ad-hoc implementations that make it hard for the main repo to maintain, dragging down productivity and maintainability

proposed change

verl shall maintain a few folders:

/verl : contains the most common building blocks / utils for RL training
/examples: scripts showing how to use verl for common use cases
/recipe: end-to-end RL recipes from data preparation, reward definition, env/tool and the training algorithm, usually extending verl's APIs. Once a certain recipe is mature, we move shareable components back to /verl.

The recipes are representative extensions. Specific environment and version / commit info are recommended to be documented to ensure reproducibility. However, in the long term the verl team will only ensure the recipes are runnable through tests but do not guarantee the full results can be reproduced across versions.

To make verl easier to extend, we need to define the interfaces. Let's break it down:

data format

The RLDataset currently provides data in the format described in https://verl.readthedocs.io/en/latest/preparation/prepare_data.html#prepare-data-for-post-training. For other tasks such as browser use / code / chat preferences, verl should document the data format and provide example Dataset implementation.

TODO:

review the current data format, and provide more documentation for the vision format
docs for SFT single/multi-turn dataset format

trainer

The RayPPOTrainer provides an example training loop for a typical on-policy RL algorithm with a reward model/function. Since #1282 dataset construction is done outside of the trainer, so it is more flexible in dataset customization. For any new algorithm that requires a different training loop (e.g. async RL, DPO), or requires considerable changes to the existing trainer, we highly recommend try inheriting from the existing trainer first, or simply making a copy of the trainer for your own exploration.

The RayPPOTrainer is expected to be maintained as an API with backward compatibility. For plugins or integrations that should be added to RayPPOTrainer interface to improve its functionality/performance, please leave a comment in this thread for discussions.

TODO:

review trainer APIs that allow to others to extend

entry script

You should always make a copy of the entry script such as main_ppo.py should you need to make change.

recipe code structure

To add a new recipe, a few things are expected from recipe maintainers:

a README.md file that describes what this recipe is about plus any background/paper for educational purpose
the verl version needed to run the recipe (either a stable version or requires verl commit xxx), requirements.txt for dependencies or recommended image.
command and script to run the recipe
provide a example training log in https://verl.readthedocs.io/en/latest/experiment/ppo.html for others' reference, and future regression tests if any
please make the recipe as much self-contained as possible for the community to understand how it's implementation, instead of making changes to /verl (unless the added functions are highly likely to be reused by other recipes).

Example structure:

/verl
/recipe/xxx/README.md
/recipe/xxx/config  
                        xxx_trainer.yaml # inherit from the default config to highlight the difference, if that makes sense. example: https://github.com/volcengine/verl/blob/main/recipe/prime/config/prime_trainer.yaml#L5
/recipe/xxx/main_xxx.py
/recipe/xxx/xxx_trainer.py
/recipe/xxx/xxx_dataset.py
/recipe/xxx/xxx_preprocess_data.py 
/recipe/xxx/run_xxx.sh 
# any other reward / algorithm / util implementation needed.

/tests/xxx/test_xxx.sh

In the case of drgrpo, it's already included in the common ppo trainer. I recommend we document individual algorithm's introduction and usage as a standalone page under a new section algorithms in the verl documentation site https://verl.readthedocs.io/en/latest, such as DrGRPO, GRPO, RF++, etc. nit: in theory the compute_advantage function can be simplified further with a registration mechanism but i won't discuss it here.

Other things to discuss in future RFCs:

make training / inference engine more modular

feedback period

5/1 - 5/8

cc list

@maksimstw @zpqiu @yhyang201 @cedricbeta @vermouth1992 @hiyouga @sunjin-k @tongyx361 @ZefanW

sunjin-k · 2025-05-02T03:36:15Z

sunjin-k
May 2, 2025

I'm sorry, I'm really busy with other things, but I'll mention the first author of DrGRPO to see if he is willing to help: @lkevinzc.

Currently a short intro and instructions for Dr.GRPO is in /verl/tree/main/recipe/drgrpo, so the docs page can be based on that.

0 replies

tongyx361 · 2025-05-03T09:15:17Z

tongyx361
May 3, 2025
Maintainer

Since we expect recipes to show how developers can customize verl, it might be nice to have a table showing which component the recipes are customizing.

Furthermore, we can decide whether a new recipe is representative based on whether it customizes a new component.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] verl recipe and extensions #1561

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[RFC] verl recipe and extensions #1561

Uh oh!

Uh oh!

eric-haibin-lin May 2, 2025 Maintainer

motivation

proposed change

data format

trainer

entry script

recipe code structure

feedback period

cc list

Replies: 2 comments

Uh oh!

sunjin-k May 2, 2025

Uh oh!

tongyx361 May 3, 2025 Maintainer

eric-haibin-lin
May 2, 2025
Maintainer

sunjin-k
May 2, 2025

tongyx361
May 3, 2025
Maintainer