Skip to content

Commit eecd0ff

Browse files
committed
update
1 parent fad732b commit eecd0ff

File tree

2 files changed

+7
-7
lines changed

2 files changed

+7
-7
lines changed

_posts/2025-04-18-openrlhf-vllm.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,29 +7,30 @@ thumbnail-img: /assets/figures/openrlhf-vllm/ray.png
77
share-img: /assets/figures/openrlhf-vllm/ray.png
88
---
99

10-
As the demand for training reasoning large language models (LLMs) grows, Reinforcement Learning from Human Feedback (RLHF) has become a pivotal technique. However, traditional RLHF training pipelines, especially those involving Proximal Policy Optimization (PPO), often face significant computational bottlenecks, with long chain-of-thought generation consuming up to 90% of the total training time.
10+
As the demand for training reasoning large language models (LLMs) grows, Reinforcement Learning from Human Feedback (RLHF) has become a pivotal technique. However, traditional RLHF training pipelines, especially those involving Proximal Policy Optimization (PPO), often face significant computational bottlenecks. In particular, for models that excel at complex reasoning tasks (often referred to as O1 models), the generation of long chain-of-thought (CoT) outputs can consume up to 90% of the total training time. This is because these models need to generate detailed step-by-step reasoning processes, which can span thousands of tokens, making the inference phase significantly more time-consuming than the training phase itself.
1111

1212
## Design Philosophy
1313

14-
To address these challenges, OpenRLHF is designed as a user-friendly, high-performance framework for Reinforcement Learning from Human Feedback (RLHF), integrating key technologies such as Ray, vLLM, ZeRO-3, and AutoTP:
14+
To address these challenges, OpenRLHF is designed as a user-friendly, high-performance framework for Reinforcement Learning from Human Feedback (RLHF), integrating key technologies such as Ray, vLLM, Zero Redundancy Optimizer (ZeRO-3), and Automatic Tensor Parallelism (AutoTP):
1515

1616
**Ray** serves as the backbone for distributed programming within OpenRLHF. Its robust scheduling and orchestration capabilities make it ideal for managing the complex data flows and computations inherent in RLHF training, including the distribution of reward models across multiple nodes.
1717

18-
**vLLM with Ray Executor and AutoTP** is central to accelerating inference within OpenRLHF. It naturally supports Ray Executors and integrates with Hugging Face Transformers, enabling efficient weight updates through AutoTP. This combination ensures high-throughput, memory-efficient serving of large language models.
18+
**vLLM with Ray Executor and AutoTP** is central to accelerating inference within OpenRLHF. It naturally supports Ray Executors and integrates with Hugging Face Transformers, enabling efficient weight updates through AutoTP. This combination ensures high-throughput, memory-efficient generation of large language models.
1919

2020
**ZeRO-3 with HuggingFace Transformers**, a memory optimization strategy from DeepSpeed, enables OpenRLHF to train large-scale models without the need for complex frameworks like Megatron. This allows for seamless integration with HuggingFace Transformers, facilitating straightforward loading and fine-tuning of pre-trained models.
2121

2222
By combining Ray, vLLM, ZeRO-3, and HuggingFace Transformers, OpenRLHF offers a leading and simple solution for accelerating RLHF training. This architecture has influenced other frameworks, such as veRL, which adopt a similar paradigm for efficient and scalable RLHF training.
2323

2424
<img align="center" src="/assets/figures/openrlhf-vllm/ray.png" alt="Ray and vLLM in OpenRLHF" width="90%" height="90%">
2525

26-
​As illustrated in the figure, OpenRLHF utilizes [Ray's placement group API](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) to flexibly schedule various RLHF components, including the vLLM engine, Actor, Critic, Reference, and Reward models. While these models are depicted separately, they can be co-located within shared placement groups to optimize resource utilization. For instance, all modules can share the same GPU group in a Hybrid Engine configuration, or specific components like the Actor and Critic can be assigned to the same GPU group. Weight synchronization between the Actor and the vLLM engine is achieved through high-performance communication mechanisms such as NVIDIA's NCCL or CUDA IPC memory copying, particularly in Hybrid Engine setups.
26+
​As illustrated in the figure, OpenRLHF utilizes [Ray's placement group API](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) to flexibly schedule various RLHF components, including the vLLM engine, Actor, Critic, Reference, and Reward models. While these models are depicted separately, they can be co-located within shared Ray placement groups to optimize resource utilization. For instance, all modules can share the same GPU group in a Hybrid Engine configuration, or specific components like the Actor and Critic can be assigned to the same GPU group. All these modules are scheduled and controlled through a centralized Ray Actor, which orchestrates the entire training process. Weight synchronization between the Actor and the vLLM engine is achieved through high-performance communication mechanisms such as NVIDIA Collective Communications Library (NCCL) or Compute Unified Device Architecture (CUDA) Inter-Process Communication (IPC) memory copying, particularly in Hybrid Engine setups.
27+
2728

2829
## Implementing RLHF Acceleration with vLLM Ray Executor
2930

3031
vLLM provides examples demonstrating how to accelerate RLHF training using Ray. By defining a custom `WorkerExtension` class, users can implement logic for weight synchronization between training and inference components. The `VLLM_RAY_PER_WORKER_GPUS` environment variable facilitates the allocation of GPU resources per worker, enabling configurations like hybrid engines where multiple components share the same GPU group.
3132

32-
[An example](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html) setup involves initializing Ray with a specified number of GPUs, creating a placement group for resource allocation, and defining training actors and inference engines. The training actors handle model initialization and weight updates, while the inference engines serve the models using vLLM. Weight synchronization between these components is achieved through inter-process communication mechanisms like CUDA IPC or NCCL, ensuring consistency across the training pipeline.
33+
[The example](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html) setup involves initializing Ray with a specified number of GPUs, creating a placement group for resource allocation, and defining training actors and inference engines. The training actors handle model initialization and weight updates, while the inference engines serve the models using vLLM. Weight synchronization between these components is achieved through inter-process communication mechanisms like CUDA IPC or NCCL, ensuring consistency across the training pipeline.
3334

3435

3536
```python
@@ -164,5 +165,4 @@ for actor in training_actors:
164165
for llm in inference_engines:
165166
ray.get(llm.collective_rpc.remote("update_weights_from_ipc_handles",
166167
args=(ipc_handles,)))
167-
assert ray.get(llm.collective_rpc.remote("check_weights_changed", args=tuple()))
168-
```
168+
assert ray.get(llm.collective_rpc.remote("check_weights_changed", args=tuple()))

assets/figures/openrlhf-vllm/ray.png

6.98 KB
Loading

0 commit comments

Comments
 (0)