update

hijkzzz · hijkzzz · commit f55dd43b4dcc · 2025-04-18T06:47:27.000Z
diff --git a/_posts/2025-04-18-openrlhf-vllm.md b/_posts/2025-04-18-openrlhf-vllm.md
@@ -1,47 +1,35 @@
 ---
-layout: post
-title: "Accelerating RLHF with vLLM Ray Executor (OpenRLHF)"
-author: "The OpenRLHF Team"
-image: /assets/figures/openrlhf-vllm/ray.png
-thumbnail-img: /assets/figures/openrlhf-vllm/ray.png
-share-img: /assets/figures/openrlhf-vllm/ray.png
+layout: post  
+title: "Accelerating RLHF with vLLM Ray Executor (OpenRLHF)"  
+author: "The OpenRLHF Team"  
+image: /assets/figures/openrlhf-vllm/ray.png  
+thumbnail-img: /assets/figures/openrlhf-vllm/ray.png  
+share-img: /assets/figures/openrlhf-vllm/ray.png  
 ---
 
-As the demand for training reasoning large language models (LLMs) grows, Reinforcement Learning from Human Feedback (RLHF) has become a pivotal technique. However, traditional RLHF training pipelines, especially those involving Proximal Policy Optimization (PPO), often face significant computational bottlenecks. In particular, for models that excel at complex reasoning tasks (often referred to as O1 models), the generation of long chain-of-thought (CoT) outputs can consume up to 90% of the total training time. This is because these models need to generate detailed step-by-step reasoning processes, which can span thousands of tokens, making the inference phase significantly more time-consuming than the training phase itself.
+As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF pipelines—especially those using Proximal Policy Optimization (PPO)—are often hindered by substantial computational overhead. This challenge is particularly pronounced with models that excel at complex reasoning tasks (commonly referred to as O1 models), where generating long chain-of-thought (CoT) outputs can account for up to 90% of total training time. These models must produce detailed, step-by-step reasoning that can span thousands of tokens, making inference significantly more time-consuming than the training phase itself. As a pioneering inference framework, vLLM provides a user-friendly interface for generating RLHF samples and updating model weights.
 
-## Design Philosophy
+## Design of OpenRLHF
 
-To address these challenges, [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) is designed as a user-friendly, high-performance framework for Reinforcement Learning from Human Feedback (RLHF), integrating key technologies such as Ray, vLLM, Zero Redundancy Optimizer (ZeRO-3), and Automatic Tensor Parallelism (AutoTP):
+To strike a balance between performance and usability in RLHF frameworks, [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) is designed as a high-performance yet user-friendly solution that integrates key technologies like Ray, vLLM, Zero Redundancy Optimizer (ZeRO-3), and Automatic Tensor Parallelism (AutoTP):
 
-**Ray** serves as the backbone for distributed programming within OpenRLHF. Its robust scheduling and orchestration capabilities make it ideal for managing the complex data flows and computations inherent in RLHF training, including the distribution of rule-based reward models across multiple nodes. 
+**Ray** acts as the backbone of OpenRLHF's distributed architecture. With powerful scheduling and orchestration features, Ray efficiently manages complex data flows and computations, including distributing rule-based reward models across multiple nodes.
 
-**vLLM with Ray Executor and AutoTP** is central to accelerating inference within OpenRLHF. It naturally supports Ray Executors and integrates with Hugging Face Transformers, enabling efficient weight updates through AutoTP. This combination ensures high-throughput, memory-efficient generation of large language models.
+**vLLM with Ray Executor and AutoTP** plays a central role in accelerating inference. With built-in support for Ray Executors and integration with HuggingFace Transformers, it enables efficient weight updates through AutoTP, resulting in high-throughput and memory-efficient LLM generation.
 
-**ZeRO-3 with HuggingFace Transformers**, a memory optimization strategy from DeepSpeed, enables OpenRLHF to train large-scale models without the need for complex frameworks like Megatron. This allows for seamless integration with HuggingFace Transformers, facilitating straightforward loading and fine-tuning of pre-trained models. 
+**ZeRO-3 with HuggingFace Transformers**, a memory optimization approach from DeepSpeed, empowers OpenRLHF to train large models without requiring heavyweight frameworks like Megatron. This seamless integration with HuggingFace allows for simple loading and fine-tuning of pre-trained models.
 
-By combining Ray, vLLM, ZeRO-3, and HuggingFace Transformers, OpenRLHF offers a leading and simple solution for accelerating RLHF training. This architecture has influenced other frameworks, such as [veRL](https://github.com/volcengine/verl), which adopt a similar paradigm for efficient and scalable RLHF training.
+Together, Ray, vLLM, ZeRO-3, and HuggingFace Transformers create a cutting-edge yet streamlined solution for accelerating RLHF training. The architecture has also influenced other frameworks such as [veRL](https://github.com/volcengine/verl), which adopt similar paradigms for scalable and efficient RLHF training.
 
 <img align="center" src="/assets/figures/openrlhf-vllm/ray.png" alt="Ray and vLLM in OpenRLHF" width="90%" height="90%">
 
-​As illustrated in the figure, OpenRLHF utilizes [Ray's placement group API](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) to flexibly schedule various RLHF components, including the vLLM engine, Actor, Critic, Reference, and Reward models. While these models are depicted separately, they can be co-located within shared Ray placement groups to optimize resource utilization. For instance, all modules can share the same GPU group in a Hybrid Engine configuration, or specific components like the Actor and Critic can be assigned to the same GPU group. All these modules are scheduled and controlled through a centralized Ray Actor, which orchestrates the entire training process. Weight synchronization between the Actor and the vLLM engine is achieved through high-performance communication mechanisms such as NVIDIA Collective Communications Library (NCCL) or Compute Unified Device Architecture (CUDA) Inter-Process Communication (IPC) memory copying, particularly in Hybrid Engine setups. 
+As illustrated above, OpenRLHF uses [Ray’s Placement Group API](https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html) to flexibly schedule components of the RLHF pipeline, including the vLLM engine, Actor, Critic, Reference, and Reward models. Although represented separately, these components can be colocated in shared Ray placement groups to maximize resource efficiency. For example, all modules can operate within the same GPU group in a hybrid engine configuration, or specific components—such as the Actor and Critic—can be grouped together. All modules are orchestrated by a central Ray Actor, which manages the entire training lifecycle. Weight synchronization between the Actor and the vLLM engine is handled via high-performance communication methods, such as NVIDIA Collective Communications Library (NCCL) or CUDA Inter-Process Communication (IPC) memory transfers in hybrid engine settings.
 
+## Implementing RLHF Acceleration with vLLM Ray Executor
 
-##  Implementing RLHF Acceleration with vLLM Ray Executor
-
-vLLM provides examples demonstrating how to accelerate RLHF training using Ray. By defining a custom `WorkerExtension` class, users can implement logic for weight synchronization between training and inference components. The `VLLM_RAY_PER_WORKER_GPUS` environment variable facilitates the allocation of GPU resources per worker, enabling configurations like hybrid engines where multiple components share the same GPU group.
-
-[The example](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html) setup involves initializing Ray with a specified number of GPUs, creating a placement group for resource allocation, and defining training actors and inference engines. The training actors handle model initialization and weight updates, while the inference engines serve the models using vLLM. Weight synchronization between these components is achieved through inter-process communication mechanisms like CUDA IPC or NCCL, ensuring consistency across the training pipeline.
-
+OpenRLHF and vLLM provide a clean and efficient set of APIs to simplify interaction within RLHF pipelines. By implementing a custom `WorkerExtension` class, users can handle weight synchronization between training and inference components. The environment variable `VLLM_RAY_PER_WORKER_GPUS` allows fine-grained GPU resource allocation per worker, enabling hybrid engine configurations where multiple components share a GPU group:
 
 ```python
-import os
-import ray
-import torch
-from ray.util.placement_group import placement_group
-from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
-from vllm import LLM
-from transformers import AutoModelForCausalLM
-
 class ColocateWorkerExtension:
     """
     Extension class for vLLM workers to handle weight synchronization.
@@ -67,11 +55,6 @@ class ColocateWorkerExtension:
         self.model_runner.model.load_weights(weights=weights)
         torch.cuda.synchronize()
 
-    def check_weights_changed(self):
-        """Verify if weights have been updated"""
-        return all(torch.allclose(p, torch.zeros_like(p)) 
-                  for p in self.model_runner.model.parameters())
-
 class MyLLM(LLM):
     """
     Custom LLM class to handle GPU resource allocation and bundle indices.
@@ -85,50 +68,11 @@ class MyLLM(LLM):
         os.environ["VLLM_RAY_BUNDLE_INDICES"] = ",".join(map(str, bundle_indices))
         super().__init__(*args, **kwargs)
 
-class TrainingActor:
-    """
-    Actor class for model training.
-    Handles model initialization and weight synchronization.
-    """
-    def __init__(self):
-        self.model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
-        self.model.to("cuda:0")
-        # Initialize weights to zero for demonstration
-        for p in self.model.parameters():
-            p.data.zero_()
-        torch.cuda.synchronize()
-        from vllm.platforms import current_platform
-        self.device_uuid = current_platform.get_device_uuid(0)
-
-    def report_device_id(self) -> str:
-        return self.device_uuid
-
-    def get_weight_ipc_handles(self):
-        """Get IPC handles for model weights"""
-        from torch.multiprocessing.reductions import reduce_tensor
-        return {self.device_uuid: {name: reduce_tensor(p.detach()) 
-                                 for name, p in self.model.named_parameters()}}
-
-# Initialize Ray with 4 GPUs
-os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
-ray.init()
 
 # Create placement group for GPU allocation
 pg = placement_group([{"GPU": 1, "CPU": 0}] * 4)
 ray.get(pg.ready())
 
-# Create training actors
-training_actors = []
-for i in range(4):
-    actor = ray.remote(
-        num_gpus=0.4,
-        scheduling_strategy=PlacementGroupSchedulingStrategy(
-            placement_group=pg,
-            placement_group_bundle_index=i
-        )
-    )(TrainingActor).remote()
-    training_actors.append(actor)
-
 # Create inference engines
 inference_engines = []
 for bundle_indices in [[0, 1], [2, 3]]:
@@ -146,23 +90,6 @@ for bundle_indices in [[0, 1], [2, 3]]:
         bundle_indices=bundle_indices
     )
     inference_engines.append(llm)
+```
 
-# Collect device IDs for verification
-training_device_ids = [ray.get(actor.report_device_id.remote()) 
-                      for actor in training_actors]
-inference_device_ids = [ray.get(llm.collective_rpc.remote("report_device_id", args=tuple()))
-                       for llm in inference_engines]
-
-# Verify device placement
-assert training_device_ids[:2] == inference_device_ids[0]
-assert training_device_ids[2:] == inference_device_ids[1]
-
-# Synchronize weights
-ipc_handles = {}
-for actor in training_actors:
-    ipc_handles.update(ray.get(actor.get_weight_ipc_handles.remote()))
-
-for llm in inference_engines:
-    ray.get(llm.collective_rpc.remote("update_weights_from_ipc_handles", 
-                                     args=(ipc_handles,)))
-    assert ray.get(llm.collective_rpc.remote("check_weights_changed", args=tuple()))
+[The complete example](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html) walks through initializing Ray with a specified GPU count, creating a placement group to manage resources, and defining both training actors and inference engines. The training actors manage model initialization and weight updates, while the inference engines serve models via vLLM. Weight synchronization is carried out using CUDA IPC or NCCL, ensuring coherence and efficiency throughout the RLHF pipeline.