-
-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Open
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
Description
This RFC tracks the current state and planned improvements for Prefill-Decode (P/D) Disaggregation using the NixlConnector, which enables high-performance KV cache transfer between prefill and decode instances using the NIXL library.
Currently Supported Features
Core Infrastructure
- NIXL Integration - Core P/D disaggregation framework ([P/D] NIXL Integration #17751)
Async KV Cache Transfers
- Fully asynchronous KV cache transfers
- [Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer #33377 - Bugfix for async scheduling + request abort + async KV transfer
- [Core] Simplify async KV output aggregation #28327 - Simplify async KV output aggregation
- [KV offload] Offloading connector async scheduling support #27648 - Async scheduling support
- [BugFix] scheduler: Fix resuming of preempted requests after async load #31583 - Fix resuming preempted requests after async load
Multi-Transport Backend Support
- Multi-transport backend support - UCX (default), LIBFABRIC, and other NIXL plugins
- ROCm support through RIXL library
- Support for OOT NIXL backends via kv_connector_extra_config (Document NixlConnector backend selection via kv_connector_extra_config #33552)
Tensor Parallelism
- Homogeneous Tensor Parallelism - P and D instances with matching TP sizes ([P/D] NIXL Integration #17751)
- Heterogeneous Tensor Parallelism - Support for different TP sizes between P and D
- [P/D] Heterogeneous TP #18833 - Base heterogeneous D TP > P TP support
- [Nixl] Heterogeneous TP support FlashInfer #20189 - Heterogeneous TP for FlashInfer
- [NIXL] Support P tensor-parallel-size > D tensor-parallel-size #27274 - P TP > D TP support (including MLA use-case)
MLA
- Add support for MLA caches with different latent dim (Deepseek v3.2 Indexer)
CPU Host Buffer Transfers
- CPU host buffer transfers - Support for platforms without direct NIXL GPU-GPU transfer (D2H->H2D), for TPU, XPU and more.
- [P/D] Support CPU Transfer in NixlConnector #18293 - Base CPU transfer support
- [Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector #24690 - CUDA to CPU memory transfers
- add cpu option for p/d in nixl_connector #28356 - Pure CPU environment support
Heterogeneous Configurations
The following also partially enable Hybrid hardware deployment among other use-cases.
- Support kernel_block_size != block_size (logical <> physical block_size mismatch)
- Heterogeneous block sizes - Different block sizes between P and D instances (cc @xuechendi )
- [NIXL] heterogeneous block_size support #26759 - Heterogeneous block_size support
- [NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout #30275 - Decoder-side post-processing for heterogeneous BlockSize
- Heterogeneous KV layout (experimental) - HND to NHD permutation via enable_permute_local_kv
- [KVConnector][Core] Support cross-layer KV blocks #27743 - Cross-layer KV blocks support
- [NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout #30275 - Heterogeneous layout handling
Reliability & Observability
- Compatibility hash validation - Automatic P/D configuration compatibility checking
- [NIXL] Add compatibility checking to NIXL KV connector handshake #29503 - Compatibility checking in NIXL handshake
- [MISC]: change NIXL compatibility hash logging level to debug #30182 - Debug logging level for compatibility hash
- Transfer failure handling - Block invalidation and kv_load_failure_policy (fail/recompute)
- [NIXL][Bugfix] Failure logging overhaul + early metadata free on failure #32031 - Failure logging overhaul + early metadata free
- [bugfix] avoid NIXL_ERR_REMOTE_DISCONNECT in nixl_connector when Prefill dies #28120 - Avoid NIXL_ERR_REMOTE_DISCONNECT on prefill failure
- [Docs] Nixl Usage recommend
failkv_load_failure_policy #32198 - Document fail kv_load_failure_policy - [NIXL] Add remote_request_id to kv_transfer_params #29665 - Add remote_request_id for better tracking
- NIXL telemetry and metrics - Transfer duration, throughput, failure counters (Prometheus)
- [NIXL][Misc] Expose metrics from NIXL for logging to CLI #25388 - Expose NIXL metrics for CLI logging
- [P/D][Nixl] Introduce
KVTransferMetricsand aggregation strategy #22188 - KVTransferMetrics aggregation strategy - [Nixl][Bugfix] Track
nixl_num_kv_expired_reqsmetric in Prometheus #32340 - Track nixl_num_kv_expired_reqs in Prometheus - [KVConnector] Add KV events to KV Connectors #28309 - KV events infrastructure
- Request timeout/expiration - Automatic KV block release on P side via VLLM_NIXL_ABORT_REQUEST_TIMEOUT
- [PD][Nixl] Remote consumer READ timeout for clearing request blocks #20139 - Remote consumer READ timeout for clearing blocks
- [Nixl][Bugfix] Track
nixl_num_kv_expired_reqsmetric in Prometheus #32340 - Expired requests metric tracking
Deployment Configurations Guides
- Multi-instance deployments - Multiple P and D instances across hosts
- [Disagg] Support large batch size in proxy server and update NixlConnector doc for DP #28782 - Proxy server improvements for high concurrency
- [test/doc] make NixlConnector example more clear #24249 - Clearer deployment examples
- Data Parallel support - DP deployments with per-rank side channel ports
- [Disagg] Support large batch size in proxy server and update NixlConnector doc for DP #28782 - DP deployment documentation and proxy improvements
Work in Progress
- NIXL+Hybrid Memory Allocator - [Core][KVConnector] Support HMA+NixlConnector #32204
- SSM (Mamba) support - draft soon cc @roikoren755
- Documentation improvements - Comprehensive usage guides and troubleshooting
- Enhanced error diagnostics - Structured logging with failure context for easier debugging
- Enable drain scaledown mode for single process deployments - [Frontend] Enable drain shutdown mode for non-DP deployments #32420
Upcoming
- Speculative decoding integration - P/D disaggregation with speculative decoding
- Pipeline parallelism support - P/D disaggregation with pipeline parallelism
- Multi-backend model support - Models with multiple attention backends (mostly validation of HMA feature coverage)
- Hybrid hardware deployment - Supported in the measure tested by @xuechendi and team
Backlog
- HTTP-based handshake endpoint - Replace ZMQ side channel with HTTP for better observability
RFC
- Bi-directional KV transfers with Nixl connector - [RFC]: [P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes #32733
Related Projects
- Encoder-Prefill-Decode Disaggregation: [Core] Encoder separation for Encode-Prefill-Decode Disaggregation #25233
- Mooncake Transfer Engine: [P/D] Introduce Mooncake Transfer Engine as kv_connector #24718, [P/D] Refactor mooncake connector sender thread using async coroutines #31573
cc @robertgshaw2-redhat @tlrmchlsmth @markmc @njhill @orozery
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request