|
1 | | -<h1 style="text-align: center;">verl: Volcano Engine Reinforcement Learning for LLM</h1> |
2 | | - |
3 | | -[](https://github.com/volcengine/verl/stargazers) |
4 | | - |
5 | | -[](https://twitter.com/verl_project) |
6 | | -<a href="https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA"><img src="https://img.shields.io/badge/Slack-verl-blueviolet?logo=slack&"></a> |
7 | | -<a href="https://arxiv.org/pdf/2409.19256"><img src="https://img.shields.io/static/v1?label=EuroSys&message=Paper&color=red"></a> |
8 | | - |
9 | | -[](https://verl.readthedocs.io/en/latest/) |
10 | | -<a href="https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> |
11 | | - |
12 | | - |
13 | | -verl is a flexible, efficient and production-ready RL training library for large language models (LLMs). |
14 | | - |
15 | | -verl is the open-source version of **[HybridFlow: A Flexible and Efficient RLHF Framework](https://arxiv.org/abs/2409.19256v2)** paper. |
16 | | - |
17 | | -verl is flexible and easy to use with: |
18 | | - |
19 | | -- **Easy extension of diverse RL algorithms**: The hybrid-controller programming model enables flexible representation and efficient execution of complex Post-Training dataflows. Build RL dataflows such as GRPO, PPO in a few lines of code. |
20 | | - |
21 | | -- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as FSDP, Megatron-LM, vLLM, SGLang, etc |
22 | | - |
23 | | -- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes. |
24 | | - |
25 | | -- Readily integration with popular HuggingFace models |
26 | | - |
27 | | - |
28 | | -verl is fast with: |
29 | | - |
30 | | -- **State-of-the-art throughput**: SOTA LLM training and inference engine integrations and SOTA RL throughput. |
31 | | - |
32 | | -- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases. |
33 | | - |
34 | | -</p> |
35 | | - |
36 | | -## News |
37 | | -- [2025/03] [DAPO](https://dapo-sia.github.io/) is the open-sourced SOTA RL algorithm that achieves 50 points on AIME 2024 based on the Qwen2.5-32B pre-trained model, surpassing the previous SOTA achieved by DeepSeek's GRPO (DeepSeek-R1-Zero-Qwen-32B). DAPO's training is fully powered by verl and the reproduction code is [publicly available](https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo) now. |
38 | | -- [2025/03] We will present verl(HybridFlow) at EuroSys 2025. See you in Rotterdam! |
39 | | -- [2025/03] We introduced the programming model of verl at the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg) and [verl intro and updates](https://github.com/eric-haibin-lin/verl-community/blob/main/slides/verl-lmsys-meetup.pdf) at the [LMSys Meetup](https://lu.ma/ntjrr7ig) in Sunnyvale mid March. |
40 | | -- [2025/02] verl v0.2.0.post2 is released! See [release note](https://github.com/volcengine/verl/releases/) for details. |
41 | | -- [2025/01] [Doubao-1.5-pro](https://team.doubao.com/zh/special/doubao_1_5_pro) is released with SOTA-level performance on LLM & VLM. The RL scaling preview model is trained using verl, reaching OpenAI O1-level performance on math benchmarks (70.0 pass@1 on AIME). |
42 | | -<details><summary> more... </summary> |
43 | | -<ul> |
44 | | - <li>[2025/02] We presented verl in the <a href="https://lu.ma/ji7atxux">Bytedance/NVIDIA/Anyscale Ray Meetup</a>. See you in San Jose!</li> |
45 | | - <li>[2024/12] verl is presented at Ray Forward 2024. Slides available <a href="https://github.com/eric-haibin-lin/verl-community/blob/main/slides/Ray_Forward_2024_%E5%B7%AB%E9%94%A1%E6%96%8C.pdf">here</a></li> |
46 | | - <li>[2024/10] verl is presented at Ray Summit. <a href="https://www.youtube.com/watch?v=MrhMcXkXvJU&list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U&index=37">Youtube video</a> available.</li> |
47 | | - <li>[2024/12] The team presented <a href="https://neurips.cc/Expo/Conferences/2024/workshop/100677">Post-training LLMs: From Algorithms to Infrastructure</a> at NeurIPS 2024. <a href="https://github.com/eric-haibin-lin/verl-data/tree/neurips">Slides</a> and <a href="https://neurips.cc/Expo/Conferences/2024/workshop/100677">video</a> available.</li> |
48 | | - <li>[2024/08] HybridFlow (verl) is accepted to EuroSys 2025.</li> |
49 | | -</ul> |
50 | | -</details> |
51 | | - |
52 | | -## Key Features |
53 | | - |
54 | | -- **FSDP** and **Megatron-LM** for training. |
55 | | -- **vLLM**, **SGLang**(experimental) and **HF Transformers** for rollout generation. |
56 | | -- Compatible with Hugging Face Transformers and Modelscope Hub: Qwen-2.5, Llama3.1, Gemma2, DeepSeek-LLM, etc |
57 | | -- Supervised fine-tuning. |
58 | | -- Reinforcement learning with [PPO](examples/ppo_trainer/), [GRPO](examples/grpo_trainer/), [ReMax](examples/remax_trainer/), [Reinforce++](https://verl.readthedocs.io/en/latest/examples/config.html#algorithm), [RLOO](examples/rloo_trainer/), [PRIME](recipe/prime/), etc. |
59 | | - - Support model-based reward and function-based reward (verifiable reward) |
60 | | - - Support vision-language models (VLMs) and [multi-modal RL](examples/grpo_trainer/run_qwen2_5_vl-7b.sh) |
61 | | -- Flash attention 2, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [sequence parallelism](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh). |
62 | | -- Scales up to 70B models and hundreds of GPUs. |
63 | | -- Experiment tracking with wandb, swanlab, mlflow and tensorboard. |
64 | | - |
65 | | -## Upcoming Features |
66 | | -- DeepSeek 671b optimizations with Megatron v0.11 |
67 | | -- Multi-turn rollout optimizations |
68 | | - |
69 | | -## Getting Started |
70 | | - |
71 | | -<a href="https://verl.readthedocs.io/en/latest/index.html"><b>Documentation</b></a> |
72 | | - |
73 | | -**Quickstart:** |
74 | | -- [Installation](https://verl.readthedocs.io/en/latest/start/install.html) |
75 | | -- [Quickstart](https://verl.readthedocs.io/en/latest/start/quickstart.html) |
76 | | -- [Programming Guide](https://verl.readthedocs.io/en/latest/hybrid_flow.html) |
77 | | - |
78 | | -**Running a PPO example step-by-step:** |
79 | | -- Data and Reward Preparation |
80 | | - - [Prepare Data for Post-Training](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html) |
81 | | - - [Implement Reward Function for Dataset](https://verl.readthedocs.io/en/latest/preparation/reward_function.html) |
82 | | -- Understanding the PPO Example |
83 | | - - [PPO Example Architecture](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html) |
84 | | - - [Config Explanation](https://verl.readthedocs.io/en/latest/examples/config.html) |
85 | | - - [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html) |
86 | | - |
87 | | -**Reproducible algorithm baselines:** |
88 | | -- [PPO, GRPO, ReMax](https://verl.readthedocs.io/en/latest/experiment/ppo.html) |
89 | | - |
90 | | -**For code explanation and advance usage (extension):** |
91 | | -- PPO Trainer and Workers |
92 | | - - [PPO Ray Trainer](https://verl.readthedocs.io/en/latest/workers/ray_trainer.html) |
93 | | - - [PyTorch FSDP Backend](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html) |
94 | | - - [Megatron-LM Backend](https://verl.readthedocs.io/en/latest/index.html) |
95 | | -- Advance Usage and Extension |
96 | | - - [Ray API design tutorial](https://verl.readthedocs.io/en/latest/advance/placement.html) |
97 | | - - [Extend to Other RL(HF) algorithms](https://verl.readthedocs.io/en/latest/advance/dpo_extension.html) |
98 | | - - [Add Models with the FSDP Backend](https://verl.readthedocs.io/en/latest/advance/fsdp_extension.html) |
99 | | - - [Add Models with the Megatron-LM Backend](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html) |
100 | | - - [Deployment using Separate GPU Resources](https://github.com/volcengine/verl/tree/main/examples/split_placement) |
101 | | - |
102 | | -**Blogs from the community** |
103 | | -- [使用verl进行GRPO分布式强化学习训练最佳实践](https://www.volcengine.com/docs/6459/1463942) |
104 | | -- [HybridFlow veRL 原文浅析](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/readme.md) |
105 | | -- [最高提升20倍吞吐量!豆包大模型团队发布全新 RLHF 框架,现已开源!](https://team.doubao.com/en/blog/%E6%9C%80%E9%AB%98%E6%8F%90%E5%8D%8720%E5%80%8D%E5%90%9E%E5%90%90%E9%87%8F-%E8%B1%86%E5%8C%85%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%9B%A2%E9%98%9F%E5%8F%91%E5%B8%83%E5%85%A8%E6%96%B0-rlhf-%E6%A1%86%E6%9E%B6-%E7%8E%B0%E5%B7%B2%E5%BC%80%E6%BA%90) |
| 1 | +<h1 style="text-align: center;">Embodied R1: Incentivizing Environment Interaction |
| 2 | +Ability in LLMs via Reinforcement Learning</h1> |
106 | 3 |
|
| 4 | +## Installation |
| 5 | +1. Embodied-r1 is based on verl with vLLM>=0.8 |
| 6 | +``` |
| 7 | +# Create the conda environment |
| 8 | +conda create -n embodied-r1 python==3.10 |
| 9 | +conda activate embodied-r1 |
107 | 10 |
|
108 | | -## Performance Tuning Guide |
109 | | -The performance is essential for on-policy RL algorithm. We write a detailed performance tuning guide to allow people tune the performance. See [here](https://verl.readthedocs.io/en/latest/perf/perf_tuning.html) for more details. |
| 11 | +cd embodied-r1 |
| 12 | +pip3 install -e . |
110 | 13 |
|
111 | | -## Use vLLM v0.8 |
112 | | -veRL now supports vLLM>=0.8.0 when using FSDP as the training backend. Please refer to [this document](docs/README_vllm0.8.md) for installation guide and more information. |
| 14 | +# Install the latest stable version of vLLM |
| 15 | +pip3 install vllm==0.8.3 |
113 | 16 |
|
114 | | -## Citation and acknowledgement |
| 17 | +# Install flash-attn |
| 18 | +pip3 install flash-attn --no-build-isolation |
| 19 | +``` |
115 | 20 |
|
116 | | -If you find the project helpful, please cite: |
117 | | -- [HybridFlow: A Flexible and Efficient RLHF Framework](https://arxiv.org/abs/2409.19256v2) |
118 | | -- [A Framework for Training Large Language Models for Code Generation via Proximal Policy Optimization](https://i.cs.hku.hk/~cwu/papers/gmsheng-NL2Code24.pdf) |
| 21 | +2. Prepare environment for ALFWorld |
| 22 | +``` |
| 23 | +conda create -n alfworld python=3.9 |
| 24 | +conda activate alfworld |
119 | 25 |
|
120 | | -```tex |
121 | | -@article{sheng2024hybridflow, |
122 | | - title = {HybridFlow: A Flexible and Efficient RLHF Framework}, |
123 | | - author = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu}, |
124 | | - year = {2024}, |
125 | | - journal = {arXiv preprint arXiv: 2409.19256} |
126 | | -} |
| 26 | +# download task for training |
| 27 | +pip install alfworld |
| 28 | +alfworld-download |
127 | 29 | ``` |
128 | 30 |
|
129 | | -verl is inspired by the design of Nemo-Aligner, Deepspeed-chat and OpenRLHF. The project is adopted and supported by Anyscale, Bytedance, LMSys.org, Shanghai AI Lab, Tsinghua University, UC Berkeley, UCLA, UIUC, University of Hong Kong, and many more. |
| 31 | +3. Prepare environment for ScienceWorld |
| 32 | +``` |
| 33 | +conda create --name scienceworld python=3.8 |
| 34 | +conda activate scienceworld |
130 | 35 |
|
131 | | -## Awesome work using verl |
132 | | -- [TinyZero](https://github.com/Jiayi-Pan/TinyZero): a reproduction of **DeepSeek R1 Zero** recipe for reasoning tasks  |
133 | | -- [DAPO](https://dapo-sia.github.io/): the fully open source SOTA RL algorithm that beats DeepSeek-R1-zero-32B  |
134 | | -- [SkyThought](https://github.com/NovaSky-AI/SkyThought): RL training for Sky-T1-7B by NovaSky AI team.  |
135 | | -- [Easy-R1](https://github.com/hiyouga/EasyR1): **Multi-modal** RL training framework  |
136 | | -- [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL): LLM Agents RL tunning framework for multiple agent environments.  |
137 | | -- [deepscaler](https://github.com/agentica-project/deepscaler): iterative context scaling with GRPO  |
138 | | -- [PRIME](https://github.com/PRIME-RL/PRIME): Process reinforcement through implicit rewards  |
139 | | -- [RAGEN](https://github.com/ZihanWang314/ragen): a general-purpose reasoning **agent** training framework  |
140 | | -- [Logic-RL](https://github.com/Unakar/Logic-RL): a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.  |
141 | | -- [Search-R1](https://github.com/PeterGriffinJin/Search-R1): RL with reasoning and **searching (tool-call)** interleaved LLMs  |
142 | | -- [ReSearch](https://github.com/Agent-RL/ReSearch): Learning to **Re**ason with **Search** for LLMs via Reinforcement Learning  |
143 | | -- [DeepRetrieval](https://github.com/pat-jj/DeepRetrieval): Hacking **Real Search Engines** and **retrievers** with LLMs via RL for **information retrieval**  |
144 | | -- [cognitive-behaviors](https://github.com/kanishkg/cognitive-behaviors): Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs  |
145 | | -- [MetaSpatial](https://github.com/PzySeere/MetaSpatial): Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse  |
146 | | -- [DeepEnlighten](https://github.com/DolbyUUU/DeepEnlighten): Reproduce R1 with **social reasoning** tasks and analyze key findings  |
147 | | -- [Code-R1](https://github.com/ganler/code-r1): Reproducing R1 for **Code** with Reliable Rewards  |
148 | | -- [self-rewarding-reasoning-LLM](https://arxiv.org/pdf/2502.19613): self-rewarding and correction with **generative reward models**  |
149 | | -- [critic-rl](https://github.com/HKUNLP/critic-rl): LLM critics for code generation  |
150 | | -- [DQO](https://arxiv.org/abs/2410.09302): Enhancing multi-Step reasoning abilities of language models through direct Q-function optimization |
151 | | -- [FIRE](https://arxiv.org/abs/2410.21236): Flaming-hot initiation with regular execution sampling for large language models |
| 36 | +pip install scienceworld |
| 37 | +``` |
152 | 38 |
|
153 | | -## Contribution Guide |
154 | | -Contributions from the community are welcome! Please checkout our [roadmap](https://github.com/volcengine/verl/issues/22) and [release plan](https://github.com/volcengine/verl/issues/354). |
| 39 | +## 2. Prepare for data |
| 40 | +``` |
| 41 | +# get task data for rl training |
| 42 | +bash get_data/get_data_for_training.sh |
| 43 | +``` |
155 | 44 |
|
156 | | -### Code formatting |
157 | | -We use yapf (Google style) to enforce strict code formatting when reviewing PRs. To reformat you code locally, make sure you installed **latest** `yapf` |
158 | | -```bash |
159 | | -pip3 install yapf --upgrade |
| 45 | +## 3. Start training |
160 | 46 | ``` |
161 | | -Then, make sure you are at top level of verl repo and run |
162 | | -```bash |
163 | | -bash scripts/format.sh |
| 47 | +# Remember to replace the path in the shell script with your local path |
| 48 | +bash cmd/alf.sh |
| 49 | +
|
| 50 | +bash cmd/sci_easy.sh |
164 | 51 | ``` |
165 | | -We are HIRING! Send us an [email ](mailto:[email protected]) if you are interested in internship/FTE opportunities in MLSys/LLM reasoning/multimodal alignment. |
0 commit comments