vllm-project
diff --git a/‎_posts/2025-04-18-openrlhf-vllm.md
Lines changed: 7 additions & 7 deletions b/‎_posts/2025-04-18-openrlhf-vllm.md
Lines changed: 7 additions & 7 deletions
diff --git a/‎assets/figures/openrlhf-vllm/ray.png
6.98 KB b/‎assets/figures/openrlhf-vllm/ray.png
6.98 KB
@@ -7,29 +7,30 @@ thumbnail-img: /assets/figures/openrlhf-vllm/ray.png
 share-img: /assets/figures/openrlhf-vllm/ray.png
 ---
 
-As the demand for training reasoning large language models (LLMs) grows, Reinforcement Learning from Human Feedback (RLHF) has become a pivotal technique. However, traditional RLHF training pipelines, especially those involving Proximal Policy Optimization (PPO), often face significant computational bottlenecks, with long chain-of-thought generation consuming up to 90% of the total training time.
+As the demand for training reasoning large language models (LLMs) grows, Reinforcement Learning from Human Feedback (RLHF) has become a pivotal technique. However, traditional RLHF training pipelines, especially those involving Proximal Policy Optimization (PPO), often face significant computational bottlenecks. In particular, for models that excel at complex reasoning tasks (often referred to as O1 models), the generation of long chain-of-thought (CoT) outputs can consume up to 90% of the total training time. This is because these models need to generate detailed step-by-step reasoning processes, which can span thousands of tokens, making the inference phase significantly more time-consuming than the training phase itself.
 
 ## Design Philosophy
 
-To address these challenges, OpenRLHF is designed as a user-friendly, high-performance framework for Reinforcement Learning from Human Feedback (RLHF), integrating key technologies such as Ray, vLLM, ZeRO-3, and AutoTP:
+To address these challenges, OpenRLHF is designed as a user-friendly, high-performance framework for Reinforcement Learning from Human Feedback (RLHF), integrating key technologies such as Ray, vLLM, Zero Redundancy Optimizer (ZeRO-3), and Automatic Tensor Parallelism (AutoTP):
 
 **Ray** serves as the backbone for distributed programming within OpenRLHF. Its robust scheduling and orchestration capabilities make it ideal for managing the complex data flows and computations inherent in RLHF training, including the distribution of reward models across multiple nodes. 
 
-**vLLM with Ray Executor and AutoTP** is central to accelerating inference within OpenRLHF. It naturally supports Ray Executors and integrates with Hugging Face Transformers, enabling efficient weight updates through AutoTP. This combination ensures high-throughput, memory-efficient serving of large language models.
+**vLLM with Ray Executor and AutoTP** is central to accelerating inference within OpenRLHF. It naturally supports Ray Executors and integrates with Hugging Face Transformers, enabling efficient weight updates through AutoTP. This combination ensures high-throughput, memory-efficient generation of large language models.
 
 **ZeRO-3 with HuggingFace Transformers**, a memory optimization strategy from DeepSpeed, enables OpenRLHF to train large-scale models without the need for complex frameworks like Megatron. This allows for seamless integration with HuggingFace Transformers, facilitating straightforward loading and fine-tuning of pre-trained models. 
 
 By combining Ray, vLLM, ZeRO-3, and HuggingFace Transformers, OpenRLHF offers a leading and simple solution for accelerating RLHF training. This architecture has influenced other frameworks, such as veRL, which adopt a similar paradigm for efficient and scalable RLHF training.
 
 <img align="center" src="/assets/figures/openrlhf-vllm/ray.png" alt="Ray and vLLM in OpenRLHF" width="90%" height="90%">
 
-As illustrated in the figure, OpenRLHF utilizes [Ray's placement group API](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) to flexibly schedule various RLHF components, including the vLLM engine, Actor, Critic, Reference, and Reward models. While these models are depicted separately, they can be co-located within shared placement groups to optimize resource utilization. For instance, all modules can share the same GPU group in a Hybrid Engine configuration, or specific components like the Actor and Critic can be assigned to the same GPU group. Weight synchronization between the Actor and the vLLM engine is achieved through high-performance communication mechanisms such as NVIDIA's NCCL or CUDA IPC memory copying, particularly in Hybrid Engine setups. 
+As illustrated in the figure, OpenRLHF utilizes [Ray's placement group API](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) to flexibly schedule various RLHF components, including the vLLM engine, Actor, Critic, Reference, and Reward models. While these models are depicted separately, they can be co-located within shared Ray placement groups to optimize resource utilization. For instance, all modules can share the same GPU group in a Hybrid Engine configuration, or specific components like the Actor and Critic can be assigned to the same GPU group. All these modules are scheduled and controlled through a centralized Ray Actor, which orchestrates the entire training process. Weight synchronization between the Actor and the vLLM engine is achieved through high-performance communication mechanisms such as NVIDIA Collective Communications Library (NCCL) or Compute Unified Device Architecture (CUDA) Inter-Process Communication (IPC) memory copying, particularly in Hybrid Engine setups. 
+
 
 ##  Implementing RLHF Acceleration with vLLM Ray Executor
 
 vLLM provides examples demonstrating how to accelerate RLHF training using Ray. By defining a custom `WorkerExtension` class, users can implement logic for weight synchronization between training and inference components. The `VLLM_RAY_PER_WORKER_GPUS` environment variable facilitates the allocation of GPU resources per worker, enabling configurations like hybrid engines where multiple components share the same GPU group.
 
-[An example](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html) setup involves initializing Ray with a specified number of GPUs, creating a placement group for resource allocation, and defining training actors and inference engines. The training actors handle model initialization and weight updates, while the inference engines serve the models using vLLM. Weight synchronization between these components is achieved through inter-process communication mechanisms like CUDA IPC or NCCL, ensuring consistency across the training pipeline.
+[The example](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html) setup involves initializing Ray with a specified number of GPUs, creating a placement group for resource allocation, and defining training actors and inference engines. The training actors handle model initialization and weight updates, while the inference engines serve the models using vLLM. Weight synchronization between these components is achieved through inter-process communication mechanisms like CUDA IPC or NCCL, ensuring consistency across the training pipeline.
 
 
 ```python
@@ -164,5 +165,4 @@ for actor in training_actors:
 for llm in inference_engines:
     ray.get(llm.collective_rpc.remote("update_weights_from_ipc_handles", 
                                      args=(ipc_handles,)))
-    assert ray.get(llm.collective_rpc.remote("check_weights_changed", args=tuple()))
-```
+    assert ray.get(llm.collective_rpc.remote("check_weights_changed", args=tuple()))