[Feature] Split vllm-project#1303 Part 1: PD disaggregation scaffolding (vllm-project#1863)

ahengljh · web-flow · commit 88caaf182b22 · 2026-03-16T14:05:19.000+08:00
Signed-off-by: Jinheng Li &lt;ahengljh@gmail.com&gt;
diff --git a/docs/configuration/README.md b/docs/configuration/README.md
@@ -14,6 +14,10 @@ For introduction, please check [Introduction for stage config](./stage_configs.m
 
 - **[GPU Memory Calculation and Configuration](./gpu_memory_utilization.md)** - Guide on how to calculate memory requirements and set up `gpu_memory_utilization` for optimal performance
 
+## Multi-Stage Recipes
+
+- **[Prefill-Decode Disaggregation](./pd_disaggregation.md)** - How to derive a PD-aware Qwen3-Omni stage config from the default config without introducing another bundled YAML
+
 ## Optimization Features
 
 - **[TeaCache Configuration](../user_guide/diffusion/teacache.md)** - Enable TeaCache adaptive caching for DiT models to achieve 1.5x-2.0x speedup with minimal quality loss
diff --git a/docs/configuration/pd_disaggregation.md b/docs/configuration/pd_disaggregation.md
@@ -0,0 +1,176 @@
+# Prefill-Decode (PD) Disaggregation
+
+PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode
+stages so prompt processing and token generation can run on different workers.
+
+This is documented as a stage-config recipe instead of a bundled YAML because the
+deployment-specific values usually change per environment:
+
+- GPU placement
+- `tensor_parallel_size`
+- connector backend and connector ports
+- connector IPs or bootstrap addresses
+
+Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml)
+and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the
+changes below.
+
+## Requirements
+
+- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
+- A KV connector supported by vLLM, such as `MooncakeConnector`
+- Matching `tensor_parallel_size` on the prefill and decode thinker stages
+
+## 1. Split the thinker into prefill and decode stages
+
+Replace the original thinker stage with two stages:
+
+```yaml
+stage_args:
+  - stage_id: 0
+    stage_type: llm
+    is_prefill_only: true
+    runtime:
+      devices: "0"
+      max_batch_size: 16
+    engine_args:
+      model_stage: thinker
+      model_arch: Qwen3OmniMoeForConditionalGeneration
+      worker_type: ar
+      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
+      gpu_memory_utilization: 0.9
+      enforce_eager: true
+      trust_remote_code: true
+      engine_output_type: latent
+      distributed_executor_backend: "mp"
+      enable_prefix_caching: false
+      max_num_batched_tokens: 32768
+      hf_config_name: thinker_config
+      tensor_parallel_size: 1
+      kv_transfer_config:
+        kv_connector: "MooncakeConnector"
+        kv_role: "kv_producer"
+        kv_rank: 0
+        kv_parallel_size: 2
+        kv_connector_extra_config:
+          mooncake_bootstrap_port: 25201
+    final_output: false
+    is_comprehension: true
+    default_sampling_params:
+      temperature: 0.4
+      top_p: 0.9
+      top_k: 1
+      max_tokens: 2048
+      seed: 42
+      detokenize: True
+      repetition_penalty: 1.05
+
+  - stage_id: 1
+    stage_type: llm
+    is_decode_only: true
+    runtime:
+      devices: "1"
+      max_batch_size: 64
+    engine_args:
+      model_stage: thinker
+      model_arch: Qwen3OmniMoeForConditionalGeneration
+      worker_type: ar
+      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
+      gpu_memory_utilization: 0.9
+      enforce_eager: true
+      trust_remote_code: true
+      engine_output_type: latent
+      distributed_executor_backend: "mp"
+      enable_prefix_caching: false
+      max_num_batched_tokens: 32768
+      hf_config_name: thinker_config
+      tensor_parallel_size: 1
+      kv_transfer_config:
+        kv_connector: "MooncakeConnector"
+        kv_role: "kv_consumer"
+        kv_rank: 1
+        kv_parallel_size: 2
+        kv_connector_extra_config:
+          mooncake_bootstrap_port: 25202
+    engine_input_source: [0]
+    final_output: true
+    final_output_type: text
+    is_comprehension: true
+    default_sampling_params:
+      temperature: 0.4
+      top_p: 0.9
+      top_k: 1
+      max_tokens: 2048
+      seed: 42
+      detokenize: True
+      repetition_penalty: 1.05
+```
+
+Notes:
+
+- `is_prefill_only: true` marks the thinker stage that only saves KV.
+- `is_decode_only: true` marks the thinker stage that resumes from remote KV.
+- `kv_transfer_config` is required on both stages.
+- The orchestrator forces the prefill stage to run with `max_tokens=1`, so the
+  prefill side only processes the prompt and exports KV.
+
+## 2. Shift the downstream stages by one index
+
+After inserting the extra thinker stage, renumber the remaining stages:
+
+```yaml
+  - stage_id: 2
+    runtime:
+      devices: "2"
+    engine_input_source: [1]
+    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker
+
+  - stage_id: 3
+    runtime:
+      devices: "2"
+      max_batch_size: 1
+    engine_input_source: [2]
+    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav
+```
+
+Compared with the default Qwen3-Omni config:
+
+- the talker becomes stage `2` instead of stage `1`
+- the code2wav stage becomes stage `3` instead of stage `2`
+- the talker now reads from decode stage `1`
+
+## 3. Add runtime edges for the four-stage pipeline
+
+```yaml
+runtime:
+  enabled: true
+  defaults:
+    window_size: -1
+    max_inflight: 1
+  edges:
+    - from: 0
+      to: 1
+      window_size: -1
+    - from: 1
+      to: 2
+      window_size: -1
+    - from: 2
+      to: 3
+      window_size: -1
+```
+
+## 4. Launch with your custom config
+
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
+    --stage-configs-path /path/to/qwen3_omni_pd.yaml
+```
+
+## Operational Notes
+
+- `MooncakeConnector` does not support heterogeneous TP sizes across the PD
+  pair. Keep prefill and decode at the same `tensor_parallel_size`.
+- If the thinker requires TP=2, both thinker stages must use TP=2 and be given
+  separate GPU sets, for example `"0,1"` for prefill and `"2,3"` for decode.
+- Choose connector ports and addresses that match your deployment. The values
+  shown above are examples only.
diff --git a/vllm_omni/distributed/kv_transfer/__init__.py b/vllm_omni/distributed/kv_transfer/__init__.py
@@ -0,0 +1,13 @@
+"""Patched KV transfer connectors for PD disaggregation.
+
+This package provides monkey-patched versions of vLLM's native KV transfer
+connectors (e.g. MooncakeConnector) that fix the request-ID mismatch problem
+in prefill-decode disaggregation.
+
+vLLM's ``InputProcessor.assign_request_id()`` appends a random 8-char suffix
+to each request ID internally.  The prefill engine stores KV under its own
+suffix, but the decode engine generates a *different* suffix — so it can never
+find the KV data.  The patched connector threads the prefill engine's internal
+``remote_request_id`` through ``kv_transfer_params`` so the decode side can
+reference the correct KV entry.
+"""
diff --git a/vllm_omni/distributed/kv_transfer/monkey_patch.py b/vllm_omni/distributed/kv_transfer/monkey_patch.py
@@ -0,0 +1,197 @@
+"""Monkey-patch vLLM's MooncakeConnector to fix request-ID mismatch in PD disaggregation.
+
+vLLM's InputProcessor appends a random suffix to each request ID. The prefill
+engine stores KV under its suffix, but the decode engine generates a different
+suffix. This patch threads ``remote_request_id`` through ``kv_transfer_params``
+so the decode side references the correct KV entry.
+"""
+
+from __future__ import annotations
+
+import logging
+import sys
+from dataclasses import dataclass
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+_patched: bool = False
+
+
+@dataclass
+class PatchedRecvReqMeta:
+    """Receive-request metadata carrying the prefill engine's request ID."""
+
+    request_id: str
+    remote_request_id: str
+    local_block_ids: list[int]
+    kv_transfer_params: dict[str, Any]
+
+
+def _import_mooncake_module():
+    """Import MooncakeConnector module, supporting both vLLM >=0.16 and older."""
+    try:
+        from vllm.distributed.kv_transfer.kv_connector.v1.mooncake import mooncake_connector
+
+        return mooncake_connector
+    except ImportError:
+        pass
+    try:
+        from vllm.distributed.kv_transfer.kv_connector.v1 import mooncake_connector
+
+        return mooncake_connector
+    except ImportError:
+        return None
+
+
+def _create_patched_mooncake_connector():
+    """Return a subclass of MooncakeConnector with remote_request_id support."""
+    try:
+        from vllm.distributed.kv_transfer.kv_connector.v1.mooncake.mooncake_connector import (
+            MooncakeConnector as _OriginalMooncakeConnector,
+        )
+    except (ImportError, AttributeError):
+        from vllm.distributed.kv_transfer.kv_connector.v1.mooncake_connector import (
+            MooncakeConnector as _OriginalMooncakeConnector,
+        )
+
+    class PatchedMooncakeConnector(_OriginalMooncakeConnector):
+        """Fixes request-ID mismatch in PD disaggregation by injecting
+        remote_request_id on the prefill side and using it for KV lookup
+        on the decode side.
+        """
+
+        def __init__(self, *args: Any, **kwargs: Any) -> None:
+            super().__init__(*args, **kwargs)
+            self.remote_to_local_req: dict[str, str] = {}
+            logger.info("[PatchedMooncakeConnector] Initialized")
+
+        def request_finished(
+            self,
+            request: Any,
+            block_ids: list[int],
+        ) -> tuple[bool, dict[str, Any] | None]:
+            result = super().request_finished(request, block_ids)
+
+            if isinstance(result, tuple) and len(result) == 2:
+                delay_free, kv_params = result
+            else:
+                delay_free, kv_params = False, result
+
+            # Normalise _reqs_need_send values
+            req_id = getattr(request, "request_id", None)
+            if req_id and hasattr(self, "_reqs_need_send"):
+                entry = self._reqs_need_send.get(req_id)
+                if isinstance(entry, tuple) and len(entry) == 2:
+                    self._reqs_need_send[req_id] = entry[1]
+
+            # Inject remote_request_id into kv_transfer_params
+            if kv_params is not None and isinstance(kv_params, dict):
+                kv_params["remote_request_id"] = req_id or "NOT_SET"
+                if hasattr(self, "side_channel_host"):
+                    kv_params.setdefault("remote_host", self.side_channel_host)
+                if hasattr(self, "side_channel_port"):
+                    kv_params.setdefault("remote_port", self.side_channel_port)
+
+            return delay_free, kv_params
+
+        def add_new_req(
+            self,
+            request_id: str,
+            local_block_ids: list[int],
+            kv_transfer_params: dict[str, Any] | None = None,
+            **kwargs: Any,
+        ) -> None:
+            super().add_new_req(request_id, local_block_ids, kv_transfer_params, **kwargs)
+
+            kv_transfer_params = kv_transfer_params or {}
+            load_remote_cache = kv_transfer_params.get(
+                "do_remote_prefill",
+                kv_transfer_params.get("load_remote_cache", False),
+            )
+
+            if load_remote_cache:
+                remote_request_id = kv_transfer_params.get("remote_request_id", request_id)
+                meta = PatchedRecvReqMeta(
+                    request_id=request_id,
+                    remote_request_id=remote_request_id,
+                    local_block_ids=local_block_ids,
+                    kv_transfer_params=kv_transfer_params,
+                )
+                if not hasattr(self, "_reqs_need_recv"):
+                    self._reqs_need_recv = {}
+                self._reqs_need_recv[request_id] = meta
+
+        def group_kv_pull(self, metadata: Any | None = None) -> None:
+            """Use remote_request_id as ZMQ lookup key via save-patch-restore."""
+            if not hasattr(self, "_reqs_need_recv") or not self._reqs_need_recv:
+                return
+
+            original_recv = self._reqs_need_recv.copy()
+            patched_recv: dict[str, Any] = {}
+
+            for local_id, meta in original_recv.items():
+                if isinstance(meta, PatchedRecvReqMeta):
+                    remote_id = meta.remote_request_id
+                    self.remote_to_local_req[remote_id] = local_id
+                    patched_meta = type(meta)(
+                        request_id=remote_id,
+                        remote_request_id=remote_id,
+                        local_block_ids=meta.local_block_ids,
+                        kv_transfer_params=meta.kv_transfer_params,
+                    )
+                    patched_recv[remote_id] = patched_meta
+                else:
+                    patched_recv[local_id] = meta
+
+            self._reqs_need_recv = patched_recv
+            super().group_kv_pull(metadata)
+
+            # Restore unconsumed entries to original local keys
+            for remote_id, local_id in list(self.remote_to_local_req.items()):
+                if remote_id in self._reqs_need_recv:
+                    entry = self._reqs_need_recv.pop(remote_id)
+                    self._reqs_need_recv[local_id] = original_recv.get(local_id, entry)
+
+        def receive_kv(self, path: Any = None, req_blocks: Any = None) -> Any:
+            result = super().receive_kv(path, req_blocks)
+
+            if self.remote_to_local_req:
+                completed = [
+                    rid
+                    for rid, lid in self.remote_to_local_req.items()
+                    if not hasattr(self, "_reqs_need_recv") or lid not in self._reqs_need_recv
+                ]
+                for remote_id in completed:
+                    self.remote_to_local_req.pop(remote_id, None)
+
+            return result
+
+    PatchedMooncakeConnector.__qualname__ = _OriginalMooncakeConnector.__qualname__
+
+    return PatchedMooncakeConnector
+
+
+def apply_mooncake_connector_patch() -> bool:
+    """Replace vLLM's MooncakeConnector with the patched version."""
+    global _patched
+    if _patched:
+        return True
+
+    _mc_module = _import_mooncake_module()
+    if _mc_module is None:
+        logger.warning("[monkey_patch] Cannot import MooncakeConnector — patch NOT applied.")
+        return False
+
+    _OriginalClass = _mc_module.MooncakeConnector
+
+    PatchedClass = _create_patched_mooncake_connector()
+
+    _mc_module.MooncakeConnector = PatchedClass
+    for _, module in sys.modules.items():
+        if hasattr(module, "MooncakeConnector") and module.MooncakeConnector is _OriginalClass:
+            module.MooncakeConnector = PatchedClass
+
+    _patched = True
+    logger.info("[monkey_patch] MooncakeConnector patch applied")
+    return True
diff --git a/vllm_omni/entrypoints/pd_utils.py b/vllm_omni/entrypoints/pd_utils.py