Skip to content

Commit f65fd72

Browse files
joyang-nvSuperjomn
andauthored
[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine (verl-project#4665)
## What does this PR do? [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) has recently added [Ray orchestrator](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/ray_orchestrator) and essential features required for the RL workflow. This PR introduces TensorRT-LLM as a new rollout engine for VeRL. VeRL currently supports several rollout modes: - **Hybrid engine:** The training and rollout engines share the same process group. VeRL uses the `WorkerDict` class to manage multiple workers within a single process group. Communication between training and rollout workers takes place within the same process, allowing them to share the Torch GPU memory pool. - **Colocated:** Different engines use the same set of GPUs but run in separate process groups. Currently, this mode is used only by the reward model. - **Standalone:** Rollout engines use completely independent GPU resources. Unlike other rollout engines, TensorRT-LLM primarily targets the *colocated* mode. However, instead of relying purely on standard colocated mode, we introduced a mixed design combining aspects of the hybrid engine and colocated mode. The design goals are: - Clear resource separation through distinct process groups, offering maximum flexibility between training and rollout processes. - Hybrid workers that act as proxies to LLM servers. - Fully RESTful rollout API support through `TRTLLMHttpServer`. - A unified framework for both asynchronous and synchronous RL workflows. This PR aims to make the integration as minimally intrusive as possible to VeRL's infrastructure. Currently, it only invokes `RolloutReplica.init_hybrid_colocated()` when both the hybrid engine is enabled and the rollout engine is set to TensorRT-LLM. ## High Level Design Please refer to [workers/rollout/trtllm_rollout/trtllm_async_rollout.md](https://github.com/davidmlw/verl/pull/43/changes#diff-96bab8796296991333a973a5211166f45993b13d7c533732c83bcf23c5664f39) for more details. ```mermaid %%{init: {'theme':'base', 'themeVariables': { 'fontSize':'18px', 'edgeLabelBackground':'#eeeeee'}}}%% flowchart TB space1[" "] style space1 fill:none,stroke:none subgraph VERL["<b>VERL Training Pipeline</b>"] subgraph Workers["<b>Training Workers</b>"] Actor["<b>Actor Worker</b>"] Critic["<b>Critic Worker</b>"] RefModel["<b>Ref Model Worker</b>"] end Actor -->|<b>Weight Updates<br/>IPC</b>| Rollout["<b>TensorRT-LLM Rollout</b>"] subgraph RayCluster["<b>Rollout Workers<br/>(Ray Cluster)</b>"] space2[" "] style space2 fill:none,stroke:none subgraph AsyncRollout["<b>TRTLLMAsyncRollout<br/>(per DP rank)</b>"] DPLeader["<b>• DP Leader coordination</b>"] IPCMgmt["<b>• IPC handle management</b>"] HTTPAdapter["<b>• HTTP adapter for server communication</b>"] end AsyncRollout -->|<b>HTTP/REST API</b>| HTTPServer subgraph HTTPServer["<b>TRTLLMHttpServer<br/>(Ray Actor per Replica)</b>"] OpenAI["<b>• OpenAI Server wrapper</b>"] EngMgmt["<b>• AsyncLLM engine management</b>"] MemMgmt["<b>• Memory management (resume/release)</b>"] end HTTPServer --> AsyncLLM subgraph AsyncLLM["<b>TensorRT-LLM<br/>AsyncLLM Engine</b>"] GPUWorkers["<b>• GPU workers (Tensor Parallel)</b>"] KVCache["<b>• KV Cache management</b>"] CUDAGraph["<b>• CUDA Graph optimization</b>"] end end end space1 ~~~ VERL style VERL fill:#e1f5ff style RayCluster fill:#fff4e6 style AsyncRollout fill:#f3e5f5 style HTTPServer fill:#e8f5e9 style AsyncLLM fill:#fce4ec ``` ## Experiments results: Setup: single node with H100 * 8/slurm env. 1. FSDP/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd `bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 1`) * Convergence: <img width="563" height="352" alt="image" src="https://github.com/user-attachments/assets/5df943a7-e4ce-416f-8601-0655738bb33d" /> * Validation: <img width="1155" height="344" alt="image" src="https://github.com/user-attachments/assets/a1a203e1-a85e-46c9-a9ea-e9c0f3caf683" /> 2. FSDP/GRPO: Qwen2-7B (TP4 * 2 on 8 GPUs, launching cmd `bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 4`) * Convergence: <img width="553" height="354" alt="image" src="https://github.com/user-attachments/assets/dedfe3e2-498e-4d77-80bb-f1cd5d916c21" /> * Validation: <img width="1132" height="353" alt="image" src="https://github.com/user-attachments/assets/fbf6ae33-3643-466a-94e7-7edd70f53b3c" /> 3. Megatron/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd `bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 1`) * Convergence: <img width="766" height="323" alt="image" src="https://github.com/user-attachments/assets/6d9bc023-c5e7-466a-bf31-7ef9eda7b06d" /> * Validation: <img width="1546" height="338" alt="image" src="https://github.com/user-attachments/assets/ee6e263c-7779-4915-93dd-2f414370a9fc" /> 4. Megatron/GRPO: Qwen2-7B (TP2 * 2 on 8 GPUs, launching cmd `bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 4`) * Convergence: <img width="746" height="322" alt="image" src="https://github.com/user-attachments/assets/7a21dc39-0467-4b85-a231-8e5994b76a8a" /> * Validation: <img width="1552" height="334" alt="image" src="https://github.com/user-attachments/assets/00a3307b-a0d6-4d13-8f30-85f903f0c946" /> ## Special notes for using VeRL with TensorRT-LLM: 1. All RL required APIs for VeRL were implemented within [TensorRT-LLM 1.2.0rc6](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release?version=1.2.0rc6). To install VeRL with TensorRT-LLM, please use command `pip install -e ".[trtllm]" --extra-index-url https://pypi.nvidia.com/`. 2. All verification of integration work was primarily done in Slurm environment. 3. The current design requires `export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1` and the following env settings before launching the Ray cluster. While these have been included in any example scripts or tests added, we will work toward removing such dependencies to improve the user experience in the near future. ``` # Clean all slurm / MPI / PMIx env to avoid pmix mismatch error for v in $(env | awk -F= '/^(PMI|PMIX|MPI|OMPI|SLURM)_/{print $1}'); do unset "$v" done # Force UCX to use only eth0; otherwise, it will attempt to use all available devices and raise warnings if any issues occur. export TRTLLM_UCX_INTERFACE=eth0 ``` ## Outstanding issues for this MR: 1. WIP on passing CI tests ## Upcoming works (in separate MRs) 1. Further performance optimization 3. Multi-node testing and functionality will be delivered in the near future. 7. The current MR focuses on and was validated wtih Qwen model variants. We'll work on validations and optimizations for MoE models as the next step. > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Signed-off-by: Jonas Yang <joyang@nvidia.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
1 parent 3b1c139 commit f65fd72

File tree

23 files changed

+1676
-18
lines changed

23 files changed

+1676
-18
lines changed
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# # Tests layout
2+
3+
# Each folder under tests/ corresponds to a test category for a sub-namespace in verl. For instance:
4+
# - `tests/trainer` for testing functionality related to `verl/trainer`
5+
# - `tests/models` for testing functionality related to `verl/models`
6+
# - ...
7+
8+
# There are a few folders with `special_` prefix, created for special purposes:
9+
# - `special_distributed`: unit tests that must run with multiple GPUs
10+
# - `special_e2e`: end-to-end tests with training/generation scripts
11+
# - `special_npu`: tests for NPUs
12+
# - `special_sanity`: a suite of quick sanity tests
13+
# - `special_standalone`: a set of test that are designed to run in dedicated environments
14+
15+
# Accelerators for tests
16+
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
17+
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
18+
19+
# # Workflow layout
20+
21+
# All CI tests are configured by yaml files in `.github/workflows/`. Here's an overview of all test configs:
22+
# 1. A list of always triggered CPU sanity tests: `check-pr-title.yml`, `secrets_scan.yml`, `check-pr-title,yml`, `pre-commit.yml`, `doc.yml`
23+
# 2. Some heavy multi-GPU unit tests, such as `model.yml`, `vllm.yml`, `sgl.yml`
24+
# 3. End-to-end tests: `e2e_*.yml`
25+
# 4. Unit tests
26+
# - `cpu_unit_tests.yml`, run pytest on all scripts with file name pattern `tests/**/test_*_on_cpu.py`
27+
# - `gpu_unit_tests.yml`, run pytest on all scripts with file without the `on_cpu.py` suffix.
28+
# - Since cpu/gpu unit tests by default runs all tests under `tests`, please make sure tests are manually excluded in them when
29+
# - new workflow yaml is added to `.github/workflows`
30+
# - new tests are added to workflow mentioned in 2.
31+
32+
name: e2e_ppo_trainer_megatron_trtllm
33+
34+
on:
35+
# Trigger the workflow on push or pull request,
36+
# but only for the main branch.
37+
# For push, for now only anti-patterns are specified so it is more conservative
38+
# and achieves higher coverage.
39+
push:
40+
branches:
41+
- main
42+
- v0.*
43+
paths:
44+
- "**/*.py"
45+
# Other entrypoints
46+
- "!verl/trainer/fsdp_sft_trainer.py"
47+
# Recipes
48+
- "!recipe/**"
49+
# FSDP
50+
- "!verl/workers/**/*dp_*.py"
51+
pull_request:
52+
branches:
53+
- main
54+
- v0.*
55+
paths:
56+
- "**/*.py"
57+
# Other entrypoints
58+
- "!docker/**"
59+
# Docs
60+
- "!**/*.md"
61+
- "!docs/**"
62+
- "!examples/**"
63+
- "!tests/**"
64+
- "!verl/trainer/main_*.py"
65+
- "!verl/trainer/fsdp_sft_trainer.py"
66+
# Recipes
67+
- "!recipe/**"
68+
# FSDP
69+
- "!verl/workers/**/*dp_*.py"
70+
# Entrypoints
71+
- "verl/workers/rollout/trtllm_rollout/*"
72+
- ".github/workflows/e2e_ppo_grpo_trainer_trtllm"
73+
- "examples/data_preprocess/gsm8k.py"
74+
- "examples/data_preprocess/geo3k.py"
75+
# add back when ppo flow is ready
76+
# - "tests/special_e2e/run_ppo_trainer_megatron.sh"
77+
# - "verl/trainer/main_ppo.py"
78+
# - "verl/trainer/config/ppo_megatron_trainer.yaml"
79+
80+
# Cancel jobs on the same ref if a new one is triggered
81+
concurrency:
82+
group: ${{ github.workflow }}-${{ github.ref }}
83+
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
84+
85+
# Declare permissions just read content.
86+
permissions:
87+
contents: read
88+
89+
env:
90+
IMAGE: "verl-ci-cn-beijing.cr.volces.com/verlai/verl:trtllm1.2.0rc6"
91+
DYNAMIC_RUNNER_ENDPOINT: "https://sd10g3clalm04ug7alq90.apigateway-cn-beijing.volceapi.com/runner"
92+
93+
jobs:
94+
setup:
95+
if: github.repository_owner == 'volcengine'
96+
runs-on: ubuntu-latest
97+
outputs:
98+
runner-label: ${{ steps.create-runner.outputs.runner-label }}
99+
mlp-task-id: ${{ steps.create-runner.outputs.mlp-task-id }}
100+
steps:
101+
- uses: actions/checkout@v4
102+
- id: create-runner
103+
uses: volcengine/vemlp-github-runner@v1
104+
with:
105+
mode: "create"
106+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
107+
mlp-image: "${{ env.IMAGE }}"
108+
109+
e2e_grpo_trainer_fsdp-qwen2:
110+
needs: setup
111+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
112+
timeout-minutes: 30 # Increase this timeout value as needed
113+
env:
114+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
115+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
116+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
117+
HF_ENDPOINT: "https://hf-mirror.com"
118+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
119+
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES: "1"
120+
steps:
121+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
122+
with:
123+
fetch-depth: 0
124+
- name: Install the current repository
125+
run: |
126+
pip3 install -r requirements-test.txt
127+
pip3 install --no-deps -e .
128+
- name: Prepare GSM8K dataset
129+
run: |
130+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_save_dir ${PWD}/data/gsm8k
131+
- name: Running GSM8K E2E training tests with FSDP on 8 L20 GPUs (Qwen)
132+
run: |
133+
ray stop --force
134+
DATADIR=${HOME}/data \
135+
bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 2 \
136+
trainer.total_training_steps=1 \
137+
data.train_files="['${PWD}/data/gsm8k/train.parquet']" \
138+
data.val_files="['${PWD}/data/gsm8k/test.parquet']" \
139+
trainer.logger='["console"]' \
140+
actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct"
141+
- name: clean up
142+
run: |
143+
rm -rf checkpoints
144+
145+
e2e_grpo_trainer_megatron-qwen2:
146+
needs: setup
147+
runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
148+
timeout-minutes: 30 # Increase this timeout value as needed
149+
env:
150+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
151+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
152+
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
153+
HF_ENDPOINT: "https://hf-mirror.com"
154+
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
155+
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES: "1"
156+
steps:
157+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
158+
with:
159+
fetch-depth: 0
160+
- name: Install the current repository
161+
run: |
162+
pip3 install -r requirements-test.txt
163+
pip3 install --no-deps -e .
164+
- name: Prepare GSM8K dataset
165+
run: |
166+
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k --local_save_dir ${PWD}/data/gsm8k
167+
- name: Running GSM8K E2E training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
168+
run: |
169+
ray stop --force
170+
DATADIR=${HOME}/data \
171+
ACTOR_TP=2 \
172+
bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 2 \
173+
trainer.total_training_steps=1 \
174+
data.train_files="['${PWD}/data/gsm8k/train.parquet']" \
175+
data.val_files="['${PWD}/data/gsm8k/test.parquet']" \
176+
trainer.logger='["console"]' \
177+
actor_rollout_ref.model.path="${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct"
178+
- name: clean up
179+
run: |
180+
rm -rf checkpoints
181+
182+
cleanup:
183+
runs-on: ubuntu-latest
184+
needs:
185+
[
186+
setup,
187+
e2e_grpo_trainer_fsdp-qwen2,
188+
e2e_grpo_trainer_megatron-qwen2,
189+
]
190+
if: always()
191+
steps:
192+
- id: destroy-runner
193+
uses: volcengine/vemlp-github-runner@v1
194+
with:
195+
mode: "destroy"
196+
faas-url: "${{ env.DYNAMIC_RUNNER_ENDPOINT }}"
197+
mlp-task-id: "${{ needs.setup.outputs.mlp-task-id }}"

docker/Dockerfile.stable.trtllm

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Base image from NGC TensorRT-LLM, which includes a pre-installed TensorRT-LLM.
2+
# For available images, visit: https://nvidia.github.io/TensorRT-LLM/installation/containers.html
3+
# Use TRTLLM_BASE_IMAGE to specify the base image (default: release:1.2.0rc6)
4+
ARG TRTLLM_BASE_IMAGE=nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
5+
FROM ${TRTLLM_BASE_IMAGE}
6+
7+
8+
# ==============================================================================
9+
# Install Megatron dependencies
10+
# ==============================================================================
11+
# DeepEP is required for IBGDA support.
12+
# Clone and build gdrcopy and deepep-nvshmem dependencies.
13+
WORKDIR /home/dpsk_a2a
14+
RUN git clone -b v2.3.1 https://github.com/NVIDIA/gdrcopy.git && \
15+
git clone https://github.com/deepseek-ai/DeepEP.git && cd DeepEP && git checkout a84a248 && \
16+
cd /home/dpsk_a2a && \
17+
wget https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz && \
18+
tar -xvf nvshmem_src_3.2.5-1.txz && mv nvshmem_src deepep-nvshmem && \
19+
cd deepep-nvshmem && git apply /home/dpsk_a2a/DeepEP/third-party/nvshmem.patch && \
20+
sed -i '16i#include <getopt.h>' /home/dpsk_a2a/deepep-nvshmem/examples/moe_shuffle.cu && \
21+
sed -i 's/CUDA_STANDARD 11/CUDA_STANDARD 17/g' /home/dpsk_a2a/deepep-nvshmem/src/CMakeLists.txt && \
22+
# Cleanup downloaded archive
23+
rm /home/dpsk_a2a/nvshmem_src_3.2.5-1.txz
24+
25+
# Set environment variables
26+
ENV CUDA_HOME=/usr/local/cuda \
27+
CPATH=/usr/local/mpi/include \
28+
LD_LIBRARY_PATH=/usr/local/mpi/lib:/usr/local/x86_64-linux-gnu:$LD_LIBRARY_PATH \
29+
GDRCOPY_HOME=/home/dpsk_a2a/gdrcopy
30+
31+
# Build deepep-nvshmem
32+
WORKDIR /home/dpsk_a2a/deepep-nvshmem
33+
ARG CUDA_ARCHS="80;90;100"
34+
RUN NVSHMEM_SHMEM_SUPPORT=0 \
35+
NVSHMEM_UCX_SUPPORT=0 \
36+
NVSHMEM_USE_NCCL=0 \
37+
NVSHMEM_MPI_SUPPORT=0 \
38+
NVSHMEM_IBGDA_SUPPORT=1 \
39+
NVSHMEM_PMIX_SUPPORT=0 \
40+
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
41+
NVSHMEM_USE_GDRCOPY=1 \
42+
NVSHMEM_BUILD_EXAMPLES=0 \
43+
NVSHMEM_BUILD_TESTS=0 \
44+
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/home/dpsk_a2a/deepep-nvshmem/install -DCMAKE_CUDA_ARCHITECTURES="${CUDA_ARCHS}" && \
45+
cd build && make install -j && \
46+
# Cleanup build directory
47+
rm -rf /home/dpsk_a2a/deepep-nvshmem/build
48+
49+
# Build deepep
50+
WORKDIR /home/dpsk_a2a/DeepEP
51+
ENV NVSHMEM_DIR=/home/dpsk_a2a/deepep-nvshmem/install
52+
RUN NVSHMEM_DIR=/home/dpsk_a2a/deepep-nvshmem/install python setup.py install
53+
54+
# Install Python dependencies
55+
RUN pip3 install --no-cache-dir --no-deps trl && \
56+
pip3 install --no-cache-dir nvtx matplotlib liger_kernel && \
57+
pip install --no-cache-dir -U git+https://github.com/ISEEKYAN/mbridge.git && \
58+
pip install --no-deps --no-cache-dir git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.14.0rc7
59+
60+
61+
# ==============================================================================
62+
# Install verl dependencies
63+
# ==============================================================================
64+
RUN pip install git+https://github.com/volcengine/verl.git@v0.6.0
65+
RUN pip uninstall -y verl
66+
67+
68+
# ==============================================================================
69+
# Install a specific TensorRT-LLM on demand
70+
# ==============================================================================
71+
# Note: The NGC image already includes a pre-installed TensorRT-LLM, but you can install a specific version if needed.
72+
# Refer to https://nvidia.github.io/TensorRT-LLM/installation/index.html for more details.

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ verl is fast with:
9090
workers/fsdp_workers
9191
workers/megatron_workers
9292
workers/sglang_worker
93+
workers/trtllm_worker
9394
workers/model_engine
9495

9596
.. toctree::

docs/workers/trtllm_worker.rst

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
TensorRT-LLM Backend
2+
====================
3+
4+
Last updated: 12/31/2025.
5+
6+
**Authored By TensorRT-LLM Team**
7+
8+
Introduction
9+
------------
10+
`TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM>`_ is a high-performance LLM inference engine with state-of-the-art optimizations for NVIDIA GPUs.
11+
The verl integration of TensorRT-LLM is based on TensorRT-LLM's `Ray orchestrator <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/ray_orchestrator>`_. This integration is in its early stage, with more features and optimizations to come.
12+
13+
The TensorRT-LLM rollout engine primarily targets the colocated mode. Instead of relying purely on standard colocated mode, we adopted a mixed design combining aspects of the hybrid engine and colocated mode.
14+
15+
Installation
16+
------------
17+
We provide ``docker/Dockerfile.stable.trtllm`` for building a docker image with TensorRT-LLM pre-installed. The verl integration is supported from ``nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6``, and you can choose other TensorRT-LLM versions via ``TRTLLM_BASE_IMAGE`` from the `NGC Catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release>`_.
18+
19+
Alternatively, refer to the `TensorRT-LLM installation guide <https://nvidia.github.io/TensorRT-LLM/installation/index.html>`_ for compatible environments if you want to build your own.
20+
21+
Install verl with TensorRT-LLM:
22+
23+
.. code-block:: bash
24+
25+
pip install --upgrade pip
26+
pip install -e ".[trtllm]"
27+
28+
.. note::
29+
30+
Using the TensorRT-LLM rollout requires setting the following environment variables before launching the Ray cluster. These have been included in all the example scripts:
31+
32+
.. code-block:: bash
33+
34+
# Clean all SLURM/MPI/PMIx env to avoid PMIx mismatch error.
35+
for v in $(env | awk -F= '/^(PMI|PMIX|MPI|OMPI|SLURM)_/{print $1}'); do
36+
unset "$v"
37+
done
38+
39+
# Required for IPC UUID detection
40+
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
41+
42+
Using TensorRT-LLM as the Rollout Engine for GRPO
43+
-------------------------------------------------
44+
45+
We provide the following GRPO recipe scripts for you to test the performance and accuracy curve of TensorRT-LLM as the rollout engine:
46+
47+
.. code-block:: bash
48+
49+
## For FSDP training engine
50+
bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh
51+
## For Megatron-Core training engine
52+
bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh
53+
54+
Using TensorRT-LLM as the Rollout Engine for DAPO
55+
-------------------------------------------------
56+
57+
We provide a DAPO recipe script ``recipe/dapo/test_dapo_7b_math_trtllm.sh``.
58+
59+
.. code-block:: bash
60+
61+
## For FSDP training engine
62+
bash recipe/dapo/test_dapo_7b_math_trtllm.sh
63+
## For Megatron-Core training engine
64+
TRAIN_ENGINE=megatron bash recipe/dapo/test_dapo_7b_math_trtllm.sh
65+

0 commit comments

Comments
 (0)