Skip to content

Commit f4e7738

Browse files
authored
[None][doc] Ray orchestrator initial doc (#8373)
Signed-off-by: Erin Ho <[email protected]>
1 parent c822c11 commit f4e7738

File tree

3 files changed

+46
-3
lines changed

3 files changed

+46
-3
lines changed
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Ray Orchestrator (Prototype)
2+
3+
```{note}
4+
This project is under active development and currently in a prototype stage. The current focus is on core functionality, with performance optimization coming soon. While we strive for correctness, there are currently no guarantees regarding functionality, stability, or reliability.
5+
```
6+
7+
## Motivation
8+
The **Ray orchestrator** uses [Ray](https://docs.ray.io/en/latest/index.html) instead of MPI to manage workers for single- and multi-node inference. It’s a first step toward making TensorRT-LLM a better fit for Reinforcement Learning from Human Feedback (RLHF) workflows. For RLHF, Ray can dynamically spawn and reconnect distributed inference actors, each with its own parallelism strategy. This feature is a prototype and under active development. MPI remains the default in TensorRT-LLM.
9+
10+
11+
## Basic Usage
12+
To use Ray orchestrator, you need to first install Ray.
13+
```shell
14+
cd examples/ray_orchestrator
15+
pip install -r requirements.txt
16+
```
17+
18+
To run a simple `TP=2` example with a Hugging Face model:
19+
20+
```shell
21+
python llm_inference_distributed_ray.py
22+
```
23+
24+
This example is the same as in `/examples/llm-api`, with the only change being `orchestrator_type="ray"` on `LLM()`. Other examples can be adapted similarly by toggling this flag.
25+
26+
27+
## Features
28+
Currently available:
29+
- Generate text asynchronously (refer to [llm_inference_async_ray.py](/examples/ray_orchestrator/llm_inference_async_ray.py))
30+
- Multi-node inference (refer to [multi-node README](/examples/ray_orchestrator/multi_nodes/README.md))
31+
- Disaggregated serving (refer to [disagg README](/examples/ray_orchestrator/disaggregated/README.md))
32+
33+
*Initial testing has been focused on LLaMA and DeepSeek variants. Please open an Issue if you encounter problems with other models so we can prioritize support.*
34+
35+
## Roadmap
36+
- Performance optimization
37+
- Integration with RLHF frameworks, such as [Verl](https://github.com/volcengine/verl) and [NVIDIA NeMo-RL](https://github.com/NVIDIA-NeMo/RL).
38+
39+
## Architecture
40+
This feature introduces new classes such as [RayExecutor](/tensorrt_llm/executor/ray_executor.py) and [RayGPUWorker](/tensorrt_llm/executor/ray_gpu_worker.py) for Ray actor lifecycle management and distributed inference. In Ray mode, collective ops run on [torch.distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) without MPI. We welcome contributions to improve and extend this support.
41+
42+
![Ray orchestrator architecture](/docs/source/media/ray_orchestrator_architecture.jpg)

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ Welcome to TensorRT LLM's Documentation!
7373
features/speculative-decoding.md
7474
features/checkpoint-loading.md
7575
features/auto_deploy/auto-deploy.md
76+
features/ray-orchestrator.md
7677

7778
.. toctree::
7879
:maxdepth: 2

examples/ray_orchestrator/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ cd examples/ray_orchestrator
1818
pip install -r requirements.txt
1919
```
2020

21-
Run a simple `TP=2` example with a Hugging Face model:
21+
To run a simple `TP=2` example with a Hugging Face model:
2222

2323
```shell
2424
python llm_inference_distributed_ray.py
@@ -33,11 +33,11 @@ This example is the same as in `/examples/llm-api`, with the only change being `
3333
- Multi-node inference (refer to [multi-node README](./multi_nodes/README.md))
3434
- Disaggregated serving (refer to [disagg README](./disaggregated/README.md))
3535

36-
**Initial testing has been focused on LLaMA and DeepSeek variants. Please open an Issue if you encounter problems with other models so we can prioritize support.**
36+
*Initial testing has been focused on LLaMA and DeepSeek variants. Please open an Issue if you encounter problems with other models so we can prioritize support.*
3737

3838
### Upcoming
3939
- Performance optimization
40-
- Integration with RLHF frameworks, such as [NVIDIA Nemo-RL](https://github.com/NVIDIA-NeMo/RL) and [Verl](https://github.com/volcengine/verl).
40+
- Integration with RLHF frameworks, such as [Verl](https://github.com/volcengine/verl) and [NVIDIA Nemo-RL](https://github.com/NVIDIA-NeMo/RL).
4141

4242
## Architecture
4343
This feature introduces new classes such as [RayExecutor](/tensorrt_llm/executor/ray_executor.py) and [RayGPUWorker](/tensorrt_llm/executor/ray_gpu_worker.py) for Ray actor lifecycle management and distributed inference. In Ray mode, collective ops run on [torch.distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) without MPI. We welcome contributions to improve and extend this support.

0 commit comments

Comments
 (0)