You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NIXL supports multiple underlying communication backends for KV cache exchange in disaggregated serving. The backend can be configured using the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable.
60
+
61
+
**Supported NIXL backends:**
62
+
-**UCX** (default)
63
+
-**LIBFABRIC** (available from v0.16.0)
64
+
65
+
If an unsupported backend is specified, NIXL will automatically fall back to UCX.
66
+
67
+
For detailed setup instructions and configuration examples, please refer to the [disaggregated serving examples documentation](../../../examples/disaggregated/README.md).
68
+
56
69
### Overlap Optimization
57
70
58
71
To optimize the overall performance of disaggregated serving, TensorRT LLM overlaps the KV cache transmission with computation for multiple independent requests. While one request is sending or receiving its KV cache blocks, other requests can proceed with computation, as illustrated in Figure 4. Furthermore, if context and generation instances are using multiple GPUs per instance, KV cache transmission between different sets of GPUs can occur in parallel.
@@ -124,7 +137,11 @@ cache_transceiver_config:
124
137
max_tokens_in_buffer: <int>
125
138
```
126
139
127
-
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is NIXL.
140
+
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`. The default backend is NIXL.
141
+
142
+
Note: NIXL supports multiple underlying backends configured via the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable:
143
+
- `UCX`(default)
144
+
- `LIBFABRIC`(available from v0.16.0)
128
145
129
146
`max_tokens_in_buffer`defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
TRT-LLM uses some environment variables to control the behavior of disaggregated service.
195
212
213
+
* `TRTLLM_NIXL_KVCACHE_BACKEND`: When using NIXL as the cache transceiver backend, this variable specifies the underlying communication backend for NIXL. Valid options are:
214
+
- `UCX`(default)
215
+
- `LIBFABRIC`(available from v0.16.0)
216
+
- If an unsupported value is specified, NIXL will automatically fall back to UCX
196
217
197
218
* `TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP`: If set to `1`, generationExecutor will not overlap KV cache transfer with model inference. The default value is `0`.
198
219
@@ -240,6 +261,15 @@ A. Yes, it's recommended that different server instances use different GPUs. We
240
261
241
262
### Debugging FAQs
242
263
264
+
*Q. Why does NIXL fail to use LIBFABRIC backend even when `TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` is set?*
265
+
266
+
A: The TensorRT-LLM container doesn't include the NIXL LIBFABRIC plugin by default. You need to either:
267
+
268
+
1. **Rebuild NIXL**: Install libfabric and hwloc first, then rebuild NIXL following the installation instructions above
269
+
2. **Use a pre-compiled plugin**: If you have a compatible `libplugin_LIBFABRIC.so`, set `NIXL_PLUGINS_DIR` to point to its directory
270
+
271
+
Please see the [disaggregated serving examples documentation](../../../examples/disaggregated/README.md) for detailed installation and configuration instructions.
272
+
243
273
*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*
244
274
245
275
A. please set `backendType` of `CacheTransceiverConfig`.
Copy file name to clipboardExpand all lines: examples/disaggregated/README.md
+266Lines changed: 266 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,175 @@ cache_transceiver_config:
44
44
max_tokens_in_buffer: 2048
45
45
```
46
46
47
+
## NIXL Backend Configuration
48
+
49
+
NIXL supports multiple underlying communication backends for KV cache exchange. The backend can be configured using the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable.
50
+
51
+
**Supported NIXL backends:**
52
+
- **UCX** (default)
53
+
- **LIBFABRIC** (available from v0.16.0)
54
+
55
+
If an unsupported backend is specified, NIXL will automatically fall back to UCX.
56
+
57
+
### LIBFABRIC Backend Setup
58
+
59
+
**Important Note:** The TensorRT LLM container does not include libfabric or the NIXL-LIBFABRIC plugin by default. You must either rebuild NIXL with libfabric support or provide a pre-compiled plugin.
60
+
61
+
#### Prerequisites
62
+
63
+
##### For LIBFABRIC Backend
64
+
65
+
**Required Dependencies:**
66
+
67
+
**Libfabric**
68
+
- Custom libfabric installation is available via [https://ofiwg.github.io/libfabric/](https://ofiwg.github.io/libfabric/)
69
+
- **Minimum required version:** v1.21.0
70
+
- For EFA-enabled AWS instances, install through the [AWS EFA installer](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html) (recommend using the latest version)
71
+
72
+
**hwloc**
73
+
- hwloc is used to understand the underlying architecture to optimize application performance
74
+
- **Suggested version:** 2.10.0 or newer
75
+
76
+
**Network Hardware Requirements:**
77
+
- Validated compatibility with AWS EFA (Elastic Fabric Adapter)
78
+
79
+
##### For UCX Backend
80
+
81
+
UCX is typically pre-installed in NVIDIA GPU containers. No additional installation is usually required.
82
+
83
+
#### Installation Options
84
+
85
+
##### Option 1: Rebuild NIXL with LIBFABRIC Support (Recommended)
86
+
87
+
1. **Install libfabric dependencies:**
88
+
- Follow the installation instructions from the links above based on your system
89
+
90
+
2. **Install hwloc:**
91
+
- Use your package manager or build from source
92
+
93
+
3. **Reinstall NIXL after installing libfabric:**
94
+
- After installing libfabric and hwloc, you must rebuild NIXL to generate the LIBFABRIC plugin
95
+
- You can base your installation on the TensorRT LLM NIXL installation script located at `docker/common/install_nixl.sh`
96
+
- Modify the meson setup command in the script to include the libfabric path:
97
+
```bash
98
+
meson setup builddir \
99
+
...
100
+
-Dlibfabric_path=/path/to/libfabric \ # Add this line
101
+
--buildtype=release
102
+
```
103
+
- For more details, see the [NIXL LIBFABRIC Plugin documentation](https://github.com/ai-dynamo/nixl/tree/6ee64753605b3110f8ef96c7cfc2f1315675c9c7/src/plugins/libfabric#nixl-libfabric-plugin)
104
+
105
+
##### Option 2: Use Pre-compiled LIBFABRIC Plugin
106
+
107
+
If you have a pre-compiled `libplugin_LIBFABRIC.so` that matches your NIXL version:
108
+
109
+
1. Place the plugin file in a directory of your choice
110
+
2. Set the environment variable to point to the plugin directory:
111
+
```bash
112
+
export NIXL_PLUGINS_DIR=/path/to/plugin/directory
113
+
export TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC
114
+
```
115
+
3. Ensure the plugin was built with the same NIXL version as in your container
116
+
117
+
### NIXL Configuration Examples
118
+
119
+
To use NIXL for KV cache exchange, configure the `cache_transceiver_config` with `backend: NIXL`. The underlying NIXL backend (UCX or LIBFABRIC) is selected via the `TRTLLM_NIXL_KVCACHE_BACKEND` environment variable.
120
+
121
+
**Context server configuration:**
122
+
```yaml
123
+
# context_config_nixl.yml
124
+
disable_overlap_scheduler: True
125
+
cache_transceiver_config:
126
+
backend: NIXL
127
+
max_tokens_in_buffer: 2048
128
+
```
129
+
130
+
**Generation server configuration:**
131
+
```yaml
132
+
# gen_config_nixl.yml
133
+
cache_transceiver_config:
134
+
backend: NIXL
135
+
max_tokens_in_buffer: 2048
136
+
```
137
+
138
+
#### Example 1: Using NIXL with UCX backend (default)
139
+
140
+
```bash
141
+
# UCX is the default, but can be explicitly set
142
+
export TRTLLM_NIXL_KVCACHE_BACKEND=UCX # Optional, UCX is default
**Example configuration for AWS EFA with LIBFABRIC:**
209
+
```bash
210
+
export TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC
211
+
export FI_PROVIDER=efa
212
+
export FI_EFA_USE_DEVICE_RDMA=1
213
+
export FI_LOG_LEVEL=warn
214
+
```
215
+
47
216
### Basic Usage
48
217
49
218
For non-SLURM clusters - particularly in single-node, multi-GPU setups, it is recommended to use standard mode. In such cases, the system does not enforce limits on process creation or termination.
Additionally, we offer a fully executable script—please refer to [Disaggregated SLURM Scripts](./slurm/simple_example/).
207
376
377
+
### Kubernetes Deployment with AWS EFA
378
+
379
+
LIBFABRIC backend is particularly useful for Kubernetes deployments on AWS with EFA (Elastic Fabric Adapter) for high-performance networking between pods in disaggregated serving.
380
+
381
+
#### Prerequisites
382
+
383
+
- Kubernetes cluster with GPU nodes and EFA support
384
+
- TensorRT-LLM container with wheel package pre-installed
385
+
386
+
#### Deployment Steps
387
+
388
+
##### 1. Configure Pod Resources
389
+
390
+
When deploying on Kubernetes with EFA, ensure proper resource allocation in your pod specification:
391
+
392
+
```yaml
393
+
resources:
394
+
limits:
395
+
nvidia.com/gpu: 2# Number of GPUs for this pod
396
+
vpc.amazonaws.com/efa: 4# Number of EFA network interfaces
397
+
```
398
+
399
+
##### 2. Install EFA Libraries in Container
400
+
401
+
AWS EFA library must be installed in the container for LIBFABRIC to work:
402
+
403
+
```bash
404
+
# Install AWS EFA library (required for LIBFABRIC with EFA)
Use ConfigMaps to manage configurations for disaggregated serving:
426
+
427
+
```yaml
428
+
apiVersion: v1
429
+
kind: ConfigMap
430
+
metadata:
431
+
name: disagg-config
432
+
data:
433
+
context.yaml: |
434
+
disable_overlap_scheduler: true
435
+
cache_transceiver_config:
436
+
backend: NIXL
437
+
max_tokens_in_buffer: 2048
438
+
generation.yaml: |
439
+
cache_transceiver_config:
440
+
backend: NIXL
441
+
max_tokens_in_buffer: 2048
442
+
```
443
+
444
+
Launch services:
445
+
446
+
```bash
447
+
# For context servers
448
+
TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC \
449
+
trtllm-serve <model> \
450
+
--host localhost --port 8001 \
451
+
--config /configs/context.yaml
452
+
453
+
# For generation servers
454
+
TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC \
455
+
trtllm-serve <model> \
456
+
--host localhost --port 8002 \
457
+
--config /configs/generation.yaml
458
+
459
+
# For disaggregated proxy server
460
+
trtllm-serve disaggregated -c disagg_config.yaml
461
+
```
462
+
208
463
## Mixed Precision Context and Generation
209
464
210
465
In disaggregated serving, the context workers and generation workers have different performance characteristics: context workers are compute-bound while generation workers are memory-bound. Therefore, it may be beneficial to run context workers and generation workers in different precisions.
The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.
653
+
654
+
## Troubleshooting
655
+
656
+
### NIXL LIBFABRIC Backend Issues
657
+
658
+
**Q: Why does NIXL fail to use LIBFABRIC backend even when `TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` is set?**
659
+
660
+
A: The TensorRT-LLM container doesn't include the NIXL LIBFABRIC plugin by default. You need to either:
661
+
662
+
1. **Rebuild NIXL**: Install libfabric and hwloc first, then rebuild NIXL following the installation instructions above
663
+
2. **Use a pre-compiled plugin**: If you have a compatible `libplugin_LIBFABRIC.so`, set `NIXL_PLUGINS_DIR` to point to its directory
0 commit comments