[Roadmap]: PD Disaggregation with `NixlConnector` Roadmap

### 🚀 The feature, motivation and pitch

 ## Description

  This RFC tracks the current state and planned improvements for Prefill-Decode (P/D) Disaggregation using the NixlConnector, which enables high-performance KV cache transfer between prefill and decode instances using the NIXL library.

  Currently Supported Features

  Core Infrastructure

  - [x] NIXL Integration - Core P/D disaggregation framework (https://github.com/vllm-project/vllm/pull/17751)

  Async KV Cache Transfers

  - [x] Fully asynchronous KV cache transfers
    - https://github.com/vllm-project/vllm/pull/33377 - Bugfix for async scheduling + request abort + async KV transfer
    - https://github.com/vllm-project/vllm/pull/28327 - Simplify async KV output aggregation
    - https://github.com/vllm-project/vllm/pull/27648 - Async scheduling support
    - https://github.com/vllm-project/vllm/pull/31583 - Fix resuming preempted requests after async load

  Multi-Transport Backend Support

  - [x] Multi-transport backend support - UCX (default), LIBFABRIC, and other NIXL plugins
    - ROCm support through RIXL library 
    - Support for OOT NIXL backends via kv_connector_extra_config (https://github.com/vllm-project/vllm/pull/33552)

  Tensor Parallelism

  - [x] Homogeneous Tensor Parallelism - P and D instances with matching TP sizes (https://github.com/vllm-project/vllm/pull/17751)
  - [x] Heterogeneous Tensor Parallelism - Support for different TP sizes between P and D
    - https://github.com/vllm-project/vllm/pull/18833 - Base heterogeneous D TP > P TP support
    - https://github.com/vllm-project/vllm/pull/20189 - Heterogeneous TP for FlashInfer
    - https://github.com/vllm-project/vllm/pull/27274 - P TP > D TP support (including MLA use-case)

MLA
 - [x] Add support for MLA caches with different latent dim (Deepseek v3.2 Indexer)
    - https://github.com/vllm-project/vllm/pull/25902

  CPU Host Buffer Transfers

  - [x] CPU host buffer transfers - Support for platforms without direct NIXL GPU-GPU transfer (D2H->H2D), for TPU, XPU and more.
    - https://github.com/vllm-project/vllm/pull/18293 - Base CPU transfer support
    - https://github.com/vllm-project/vllm/pull/24690 - CUDA to CPU memory transfers
    - https://github.com/vllm-project/vllm/pull/28356 - Pure CPU environment support

  Heterogeneous Configurations 
  The following also *partially* enable Hybrid hardware deployment among other use-cases. 
 - [x]  Support kernel_block_size != block_size (logical <> physical block_size mismatch)
      -  https://github.com/vllm-project/vllm/pull/30692 - 
  - [ ] Heterogeneous block sizes - Different block sizes between P and D instances (cc @xuechendi )
    - https://github.com/vllm-project/vllm/pull/26759 - Heterogeneous block_size support
    - https://github.com/vllm-project/vllm/pull/30275 - Decoder-side post-processing for heterogeneous BlockSize
  - [x] Heterogeneous KV layout (experimental) - HND to NHD permutation via enable_permute_local_kv
    - https://github.com/vllm-project/vllm/pull/27743 - Cross-layer KV blocks support
    - https://github.com/vllm-project/vllm/pull/30275 - Heterogeneous layout handling

  Reliability & Observability

  - [x] Compatibility hash validation - Automatic P/D configuration compatibility checking
    - https://github.com/vllm-project/vllm/pull/29503 - Compatibility checking in NIXL handshake
    - https://github.com/vllm-project/vllm/pull/30182 - Debug logging level for compatibility hash
  - [x] Transfer failure handling - Block invalidation and kv_load_failure_policy (fail/recompute)
    - https://github.com/vllm-project/vllm/pull/32031 - Failure logging overhaul + early metadata free
    - https://github.com/vllm-project/vllm/pull/28120 - Avoid NIXL_ERR_REMOTE_DISCONNECT on prefill failure
    - https://github.com/vllm-project/vllm/pull/32198 - Document fail kv_load_failure_policy
    - https://github.com/vllm-project/vllm/pull/29665 - Add remote_request_id for better tracking
  - [x] NIXL telemetry and metrics - Transfer duration, throughput, failure counters (Prometheus)
    - https://github.com/vllm-project/vllm/pull/25388 - Expose NIXL metrics for CLI logging
    - https://github.com/vllm-project/vllm/pull/22188 - KVTransferMetrics aggregation strategy
    - https://github.com/vllm-project/vllm/pull/32340 - Track nixl_num_kv_expired_reqs in Prometheus
    - https://github.com/vllm-project/vllm/pull/28309 - KV events infrastructure
  - [ ] Request timeout/expiration - Automatic KV block release on P side via VLLM_NIXL_ABORT_REQUEST_TIMEOUT
    - https://github.com/vllm-project/vllm/pull/20139 - Remote consumer READ timeout for clearing blocks
    - https://github.com/vllm-project/vllm/pull/32340 - Expired requests metric tracking

  Deployment Configurations Guides

  - [x] Multi-instance deployments - Multiple P and D instances across hosts
    - https://github.com/vllm-project/vllm/pull/28782 - Proxy server improvements for high concurrency
    - https://github.com/vllm-project/vllm/pull/24249 - Clearer deployment examples
  - [x] Data Parallel support - DP deployments with per-rank side channel ports
    - https://github.com/vllm-project/vllm/pull/28782 - DP deployment documentation and proxy improvements

  Work in Progress

  - [ ] NIXL+Hybrid Memory Allocator - https://github.com/vllm-project/vllm/pull/32204
  - [ ] SSM (Mamba) support - draft soon  cc @roikoren755 
  - [ ] Documentation improvements - Comprehensive usage guides and troubleshooting
  - [x] Enhanced error diagnostics - Structured logging with failure context for easier debugging
  - [ ] Enable drain scaledown mode for single process deployments - https://github.com/vllm-project/vllm/pull/32420

  Upcoming

  - [ ] Speculative decoding integration - P/D disaggregation with speculative decoding
  - [ ] Pipeline parallelism support - P/D disaggregation with pipeline parallelism
  - [ ] Multi-backend model support - Models with multiple attention backends (mostly validation of HMA feature coverage)
  - [ ] Hybrid hardware deployment - Supported in the measure tested by @xuechendi and team 

  Backlog
  - [ ] HTTP-based handshake endpoint - Replace ZMQ side channel with HTTP for better observability

  RFC
  - [ ] Bi-directional KV transfers with Nixl connector - https://github.com/vllm-project/vllm/issues/32733

  Related Projects

  - Encoder-Prefill-Decode Disaggregation: https://github.com/vllm-project/vllm/pull/25233
  - Mooncake Transfer Engine: https://github.com/vllm-project/vllm/pull/24718, https://github.com/vllm-project/vllm/pull/31573


cc @robertgshaw2-redhat @tlrmchlsmth @markmc @njhill @orozery 

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap]: PD Disaggregation with `NixlConnector` Roadmap #33702

🚀 The feature, motivation and pitch

Description

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Description

🚀 The feature, motivation and pitch

Description

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Roadmap]: PD Disaggregation with `NixlConnector` Roadmap #33702