Skip to content

Commit f6c3bc1

Browse files
authored
[None][docs] Add NIXL-Libfabric Usage to Documentation (#10205)
Signed-off-by: Yoray Zack <[email protected]>
1 parent 7b84e48 commit f6c3bc1

File tree

2 files changed

+297
-1
lines changed

2 files changed

+297
-1
lines changed

docs/source/features/disagg-serving.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
- [Motivation](#Motivation)
44
- [KV Cache Exchange](#KV-Cache-Exchange)
55
- [Multi-backend Support](#Multi-backend-Support)
6+
- [NIXL Backend Configuration](#nixl-backend-configuration)
67
- [Overlap Optimization](#Overlap-Optimization)
78
- [Cache Layout Transformation](#Cache-Layout-Transformation)
89
- [Usage](#Usage)
@@ -53,6 +54,18 @@ In TensorRT-LLM, the KV cache exchange is modularly decoupled from the KV cache
5354
</div>
5455
<p align="center"><sub><em>Figure 3. KV cache exchange architecture</em></sub></p>
5556

57+
### NIXL Backend Configuration
58+
59+
NIXL supports multiple underlying communication backends for KV cache exchange in disaggregated serving. The backend can be configured using the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable.
60+
61+
**Supported NIXL backends:**
62+
- **UCX** (default)
63+
- **LIBFABRIC** (available from v0.16.0)
64+
65+
If an unsupported backend is specified, NIXL will automatically fall back to UCX.
66+
67+
For detailed setup instructions and configuration examples, please refer to the [disaggregated serving examples documentation](../../../examples/disaggregated/README.md).
68+
5669
### Overlap Optimization
5770

5871
To optimize the overall performance of disaggregated serving, TensorRT LLM overlaps the KV cache transmission with computation for multiple independent requests. While one request is sending or receiving its KV cache blocks, other requests can proceed with computation, as illustrated in Figure 4. Furthermore, if context and generation instances are using multiple GPUs per instance, KV cache transmission between different sets of GPUs can occur in parallel.
@@ -124,7 +137,11 @@ cache_transceiver_config:
124137
max_tokens_in_buffer: <int>
125138
```
126139
127-
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is NIXL.
140+
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`. The default backend is NIXL.
141+
142+
Note: NIXL supports multiple underlying backends configured via the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable:
143+
- `UCX` (default)
144+
- `LIBFABRIC` (available from v0.16.0)
128145

129146
`max_tokens_in_buffer` defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
130147

@@ -193,6 +210,10 @@ Please refer to [Disaggregated Inference Benchmark Scripts](../../../examples/di
193210

194211
TRT-LLM uses some environment variables to control the behavior of disaggregated service.
195212

213+
* `TRTLLM_NIXL_KVCACHE_BACKEND`: When using NIXL as the cache transceiver backend, this variable specifies the underlying communication backend for NIXL. Valid options are:
214+
- `UCX` (default)
215+
- `LIBFABRIC` (available from v0.16.0)
216+
- If an unsupported value is specified, NIXL will automatically fall back to UCX
196217

197218
* `TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP`: If set to `1`, generationExecutor will not overlap KV cache transfer with model inference. The default value is `0`.
198219

@@ -240,6 +261,15 @@ A. Yes, it's recommended that different server instances use different GPUs. We
240261

241262
### Debugging FAQs
242263

264+
*Q. Why does NIXL fail to use LIBFABRIC backend even when `TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` is set?*
265+
266+
A: The TensorRT-LLM container doesn't include the NIXL LIBFABRIC plugin by default. You need to either:
267+
268+
1. **Rebuild NIXL**: Install libfabric and hwloc first, then rebuild NIXL following the installation instructions above
269+
2. **Use a pre-compiled plugin**: If you have a compatible `libplugin_LIBFABRIC.so`, set `NIXL_PLUGINS_DIR` to point to its directory
270+
271+
Please see the [disaggregated serving examples documentation](../../../examples/disaggregated/README.md) for detailed installation and configuration instructions.
272+
243273
*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*
244274

245275
A. please set `backendType` of `CacheTransceiverConfig`.

examples/disaggregated/README.md

Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,175 @@ cache_transceiver_config:
4444
max_tokens_in_buffer: 2048
4545
```
4646

47+
## NIXL Backend Configuration
48+
49+
NIXL supports multiple underlying communication backends for KV cache exchange. The backend can be configured using the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable.
50+
51+
**Supported NIXL backends:**
52+
- **UCX** (default)
53+
- **LIBFABRIC** (available from v0.16.0)
54+
55+
If an unsupported backend is specified, NIXL will automatically fall back to UCX.
56+
57+
### LIBFABRIC Backend Setup
58+
59+
**Important Note:** The TensorRT LLM container does not include libfabric or the NIXL-LIBFABRIC plugin by default. You must either rebuild NIXL with libfabric support or provide a pre-compiled plugin.
60+
61+
#### Prerequisites
62+
63+
##### For LIBFABRIC Backend
64+
65+
**Required Dependencies:**
66+
67+
**Libfabric**
68+
- Custom libfabric installation is available via [https://ofiwg.github.io/libfabric/](https://ofiwg.github.io/libfabric/)
69+
- **Minimum required version:** v1.21.0
70+
- For EFA-enabled AWS instances, install through the [AWS EFA installer](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html) (recommend using the latest version)
71+
72+
**hwloc**
73+
- hwloc is used to understand the underlying architecture to optimize application performance
74+
- **Suggested version:** 2.10.0 or newer
75+
76+
**Network Hardware Requirements:**
77+
- Validated compatibility with AWS EFA (Elastic Fabric Adapter)
78+
79+
##### For UCX Backend
80+
81+
UCX is typically pre-installed in NVIDIA GPU containers. No additional installation is usually required.
82+
83+
#### Installation Options
84+
85+
##### Option 1: Rebuild NIXL with LIBFABRIC Support (Recommended)
86+
87+
1. **Install libfabric dependencies:**
88+
- Follow the installation instructions from the links above based on your system
89+
90+
2. **Install hwloc:**
91+
- Use your package manager or build from source
92+
93+
3. **Reinstall NIXL after installing libfabric:**
94+
- After installing libfabric and hwloc, you must rebuild NIXL to generate the LIBFABRIC plugin
95+
- You can base your installation on the TensorRT LLM NIXL installation script located at `docker/common/install_nixl.sh`
96+
- Modify the meson setup command in the script to include the libfabric path:
97+
```bash
98+
meson setup builddir \
99+
...
100+
-Dlibfabric_path=/path/to/libfabric \ # Add this line
101+
--buildtype=release
102+
```
103+
- For more details, see the [NIXL LIBFABRIC Plugin documentation](https://github.com/ai-dynamo/nixl/tree/6ee64753605b3110f8ef96c7cfc2f1315675c9c7/src/plugins/libfabric#nixl-libfabric-plugin)
104+
105+
##### Option 2: Use Pre-compiled LIBFABRIC Plugin
106+
107+
If you have a pre-compiled `libplugin_LIBFABRIC.so` that matches your NIXL version:
108+
109+
1. Place the plugin file in a directory of your choice
110+
2. Set the environment variable to point to the plugin directory:
111+
```bash
112+
export NIXL_PLUGINS_DIR=/path/to/plugin/directory
113+
export TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC
114+
```
115+
3. Ensure the plugin was built with the same NIXL version as in your container
116+
117+
### NIXL Configuration Examples
118+
119+
To use NIXL for KV cache exchange, configure the `cache_transceiver_config` with `backend: NIXL`. The underlying NIXL backend (UCX or LIBFABRIC) is selected via the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable.
120+
121+
**Context server configuration:**
122+
```yaml
123+
# context_config_nixl.yml
124+
disable_overlap_scheduler: True
125+
cache_transceiver_config:
126+
backend: NIXL
127+
max_tokens_in_buffer: 2048
128+
```
129+
130+
**Generation server configuration:**
131+
```yaml
132+
# gen_config_nixl.yml
133+
cache_transceiver_config:
134+
backend: NIXL
135+
max_tokens_in_buffer: 2048
136+
```
137+
138+
#### Example 1: Using NIXL with UCX backend (default)
139+
140+
```bash
141+
# UCX is the default, but can be explicitly set
142+
export TRTLLM_NIXL_KVCACHE_BACKEND=UCX # Optional, UCX is default
143+
144+
# Start Context servers with NIXL using UCX
145+
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
146+
--host localhost --port 8001 --backend pytorch \
147+
--config ./context_config_nixl.yml &> log_ctx_0 &
148+
149+
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
150+
--host localhost --port 8002 --backend pytorch \
151+
--config ./context_config_nixl.yml &> log_ctx_1 &
152+
153+
# Start Generation server with NIXL using UCX
154+
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
155+
--host localhost --port 8003 --backend pytorch \
156+
--config ./gen_config_nixl.yml &> log_gen_0 &
157+
```
158+
159+
#### Example 2: Using NIXL with LIBFABRIC backend
160+
161+
```bash
162+
# Configure NIXL to use LIBFABRIC backend
163+
export TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC
164+
165+
# If using pre-compiled plugin:
166+
# export NIXL_PLUGINS_DIR=/path/to/plugin/directory
167+
168+
# For AWS EFA (optional):
169+
# export FI_PROVIDER=efa
170+
# export FI_EFA_USE_DEVICE_RDMA=1
171+
# export FI_LOG_LEVEL=warn
172+
173+
# Start Context servers with NIXL using LIBFABRIC
174+
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
175+
--host localhost --port 8001 --backend pytorch \
176+
--config ./context_config_nixl.yml &> log_ctx_0 &
177+
178+
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
179+
--host localhost --port 8002 --backend pytorch \
180+
--config ./context_config_nixl.yml &> log_ctx_1 &
181+
182+
# Start Generation server with NIXL using LIBFABRIC
183+
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
184+
--host localhost --port 8003 --backend pytorch \
185+
--config ./gen_config_nixl.yml &> log_gen_0 &
186+
```
187+
188+
### Environment Variables for NIXL Backends
189+
190+
**NIXL Backend Selection:**
191+
- `TRTLLM_NIXL_KVCACHE_BACKEND`: Selects the underlying backend for NIXL. Valid options:
192+
- `UCX` (default)
193+
- `LIBFABRIC` (available from v0.16.0)
194+
- If an unsupported value is provided, NIXL automatically falls back to UCX
195+
196+
**Additional Environment Variables by Backend:**
197+
198+
**For UCX backend:**
199+
- `UCX_MAX_RNDV_RAILS`: Maximum number of InfiniBand NIC devices per GPU. Setting to 1 can reduce contention in multi-GPU scenarios
200+
- Standard UCX environment variables apply
201+
202+
**For LIBFABRIC backend:**
203+
- `NIXL_PLUGINS_DIR`: Directory containing the NIXL LIBFABRIC plugin (`libplugin_LIBFABRIC.so`) if using pre-compiled plugin
204+
- `FI_PROVIDER`: Specifies the libfabric provider to use (e.g., `efa` for AWS EFA)
205+
- `FI_EFA_USE_DEVICE_RDMA`: Set to `1` to enable GPU Direct RDMA on AWS EFA (if supported)
206+
- `FI_LOG_LEVEL`: Controls libfabric logging verbosity (e.g., `warn`, `info`, `debug`)
207+
208+
**Example configuration for AWS EFA with LIBFABRIC:**
209+
```bash
210+
export TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC
211+
export FI_PROVIDER=efa
212+
export FI_EFA_USE_DEVICE_RDMA=1
213+
export FI_LOG_LEVEL=warn
214+
```
215+
47216
### Basic Usage
48217

49218
For non-SLURM clusters - particularly in single-node, multi-GPU setups, it is recommended to use standard mode. In such cases, the system does not enforce limits on process creation or termination.
@@ -205,6 +374,92 @@ srun -A <account> -p <partition> -t <time> \
205374

206375
Additionally, we offer a fully executable script—please refer to [Disaggregated SLURM Scripts](./slurm/simple_example/).
207376

377+
### Kubernetes Deployment with AWS EFA
378+
379+
LIBFABRIC backend is particularly useful for Kubernetes deployments on AWS with EFA (Elastic Fabric Adapter) for high-performance networking between pods in disaggregated serving.
380+
381+
#### Prerequisites
382+
383+
- Kubernetes cluster with GPU nodes and EFA support
384+
- TensorRT-LLM container with wheel package pre-installed
385+
386+
#### Deployment Steps
387+
388+
##### 1. Configure Pod Resources
389+
390+
When deploying on Kubernetes with EFA, ensure proper resource allocation in your pod specification:
391+
392+
```yaml
393+
resources:
394+
limits:
395+
nvidia.com/gpu: 2 # Number of GPUs for this pod
396+
vpc.amazonaws.com/efa: 4 # Number of EFA network interfaces
397+
```
398+
399+
##### 2. Install EFA Libraries in Container
400+
401+
AWS EFA library must be installed in the container for LIBFABRIC to work:
402+
403+
```bash
404+
# Install AWS EFA library (required for LIBFABRIC with EFA)
405+
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
406+
tar -xf aws-efa-installer-latest.tar.gz
407+
cd aws-efa-installer && ./efa_installer.sh --yes --skip-kmod
408+
```
409+
410+
##### 3. Rebuild NIXL with EFA Support
411+
412+
Follow the NIXL rebuild instructions from the LIBFABRIC Backend Setup section, ensuring the libfabric path points to the EFA installation:
413+
414+
```bash
415+
meson setup builddir \
416+
-Ducx_path=/usr/local/ucx \
417+
-Dlibfabric_path=/opt/amazon/efa \ # EFA libfabric installation path
418+
-Dcudapath_lib=/usr/local/cuda/lib64 \
419+
-Dcudapath_inc=/usr/local/cuda/include \
420+
--buildtype=release
421+
```
422+
423+
##### 4. Configure and Launch Services
424+
425+
Use ConfigMaps to manage configurations for disaggregated serving:
426+
427+
```yaml
428+
apiVersion: v1
429+
kind: ConfigMap
430+
metadata:
431+
name: disagg-config
432+
data:
433+
context.yaml: |
434+
disable_overlap_scheduler: true
435+
cache_transceiver_config:
436+
backend: NIXL
437+
max_tokens_in_buffer: 2048
438+
generation.yaml: |
439+
cache_transceiver_config:
440+
backend: NIXL
441+
max_tokens_in_buffer: 2048
442+
```
443+
444+
Launch services:
445+
446+
```bash
447+
# For context servers
448+
TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC \
449+
trtllm-serve <model> \
450+
--host localhost --port 8001 \
451+
--config /configs/context.yaml
452+
453+
# For generation servers
454+
TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC \
455+
trtllm-serve <model> \
456+
--host localhost --port 8002 \
457+
--config /configs/generation.yaml
458+
459+
# For disaggregated proxy server
460+
trtllm-serve disaggregated -c disagg_config.yaml
461+
```
462+
208463
## Mixed Precision Context and Generation
209464

210465
In disaggregated serving, the context workers and generation workers have different performance characteristics: context workers are compute-bound while generation workers are memory-bound. Therefore, it may be beneficial to run context workers and generation workers in different precisions.
@@ -395,3 +650,14 @@ trtllm-serve disaggregated -c disagg_config.yaml
395650
```
396651

397652
The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.
653+
654+
## Troubleshooting
655+
656+
### NIXL LIBFABRIC Backend Issues
657+
658+
**Q: Why does NIXL fail to use LIBFABRIC backend even when `TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` is set?**
659+
660+
A: The TensorRT-LLM container doesn't include the NIXL LIBFABRIC plugin by default. You need to either:
661+
662+
1. **Rebuild NIXL**: Install libfabric and hwloc first, then rebuild NIXL following the installation instructions above
663+
2. **Use a pre-compiled plugin**: If you have a compatible `libplugin_LIBFABRIC.so`, set `NIXL_PLUGINS_DIR` to point to its directory

0 commit comments

Comments
 (0)