|
1 | 1 | # Disaggregated Inference for Omni-Modality Models |
2 | 2 |
|
3 | | -This guide explains how to configure and use distributed connectors (vllm_omni/distributed/connectors) in vllm-omni for multi-stage pipelines. |
| 3 | +This guide explains how to configure and use distributed connectors |
| 4 | +(`vllm_omni/distributed/omni_connectors`) in vllm-omni for multi-stage pipelines. |
4 | 5 |
|
5 | | -## 1. Overview |
| 6 | +Backend-specific setup lives in separate docs: |
| 7 | + |
| 8 | +- [SharedMemoryConnector](omni_connectors/shared_memory_connector.md) |
| 9 | +- [MooncakeConnector](omni_connectors/mooncake_connector.md) |
| 10 | +- [YuanrongConnector](omni_connectors/yuanrong_connector.md) |
| 11 | + |
| 12 | +## Overview |
6 | 13 |
|
7 | 14 | Connectors enable data transfer between pipeline stages (e.g., Thinker -> Talker). |
8 | | -Currently supported connectors operate in **D2H2D (Device-to-Host-to-Device)** mode: |
9 | | -1. **SharedMemoryConnector**: Uses system shared memory. |
10 | | -2. **MooncakeConnector**: Uses [Mooncake](https://github.com/kvcache-ai/Mooncake). |
| 15 | +Current connectors operate in D2H2D (device to host to device) mode. |
11 | 16 |
|
12 | | -* **SharedMemoryConnector (Default)**: Zero-copy (on host), lowest latency. Best for **single-node** deployments. Auto-configured if no connectors are specified. |
13 | | -* **MooncakeConnector**: TCP/RDMA based. Best for **multi-node** distributed deployments. Requires a Mooncake Master service. |
| 17 | +## Connector Choices |
14 | 18 |
|
15 | | -## 2. API Design |
| 19 | +| Use Case | Recommended Connector | Notes | |
| 20 | +| :--- | :--- | :--- | |
| 21 | +| Single node | SharedMemoryConnector | Auto-configured if no connector is specified. | |
| 22 | +| Multi node (Mooncake) | MooncakeConnector | Requires Mooncake Master + metadata server. | |
| 23 | +| Multi node (Yuanrong) | YuanrongConnector | Requires Yuanrong Datasystem + etcd. | |
16 | 24 |
|
17 | | -The connector system is built around the `OmniConnectorBase` abstraction, which decouples data transport from stage logic. |
| 25 | +## Core API |
18 | 26 |
|
19 | | -### Core Interface |
| 27 | +The connector system is built around `OmniConnectorBase`. |
20 | 28 |
|
21 | 29 | ```python |
22 | 30 | class OmniConnectorBase(ABC): |
23 | 31 | @abstractmethod |
24 | | - def put(self, from_stage: str, to_stage: str, request_id: str, data: Any) -> tuple[bool, int, Optional[dict]]: |
| 32 | + def put(self, from_stage: str, to_stage: str, put_key: str, data: Any) -> tuple[bool, int, Optional[dict]]: |
25 | 33 | """ |
26 | 34 | Store data. |
27 | 35 | Returns: (success, serialized_size, metadata) |
28 | 36 | """ |
29 | 37 | pass |
30 | 38 |
|
31 | 39 | @abstractmethod |
32 | | - def get(self, from_stage: str, to_stage: str, request_id: str, metadata: Optional[dict] = None) -> Optional[tuple[Any, int]]: |
| 40 | + def get(self, from_stage: str, to_stage: str, get_key: str, metadata: Optional[dict] = None) -> Optional[tuple[Any, int]]: |
33 | 41 | """ |
34 | 42 | Retrieve data. |
35 | | - Args: metadata - Transport-specific handles returned by put() (e.g., SHM name). |
| 43 | + Args: metadata - transport-specific handles returned by put() (e.g., SHM name). |
36 | 44 | Returns: (object, serialized_size) |
37 | 45 | """ |
38 | 46 | pass |
39 | 47 | ``` |
40 | 48 |
|
41 | | -### Key Concept: Metadata Passing |
42 | | -Unlike a pure key-value store, some connectors (like `SharedMemoryConnector`) generate transient resources (e.g., a shared memory block name) during `put()`. This `metadata` **must be passed via the control plane** (e.g., HTTP headers, queue messages) from the producer stage to the consumer stage so `get()` can locate the data. |
43 | | - |
44 | | -## 3. Backends & Use Cases |
45 | | - |
46 | | -### 3.1 SharedMemoryConnector |
47 | | -**Best for:** Single-node, high-performance IPC. |
48 | | - |
49 | | -* **Mechanism:** |
50 | | - * **Small Payloads (< Threshold)**: Data is serialized and passed directly "inline" within the `metadata` dictionary. This avoids the overhead of creating SHM blocks for tiny messages. |
51 | | - * **Large Payloads (>= Threshold)**: Data is written to a named System Shared Memory block. The block name is returned in `metadata`. |
52 | | -* **Configuration:** |
53 | | - * `shm_threshold_bytes`: Size in bytes to switch from inline to SHM (default: 64KB). |
54 | | - |
55 | | -### 3.2 MooncakeConnector |
56 | | -**Best for:** Multi-node distributed inference. |
57 | | - |
58 | | -* **Mechanism:** Uses Mooncake's distributed KVCache store. |
59 | | - * **Data Plane**: TCP or RDMA for high-bandwidth transfer. |
60 | | - * **Control Plane**: Uses a centralized Mooncake Master and Metadata Server. |
61 | | - * **Keying**: Deterministic keys based on `request_id/from_stage_to_stage`. |
62 | | -* **Requirements**: Requires a running Mooncake Master service. |
63 | | - |
64 | | -## 4. Relationship with vLLM |
65 | | - |
66 | | -vLLM provides specialized distributed mechanisms for specific artifacts: |
67 | | -* **KV Transfer** (`vllm.distributed.kv_transfer`): Optimized for transferring KV caches between prefill and decode instances (using NCCL, Mooncake, etc.). |
68 | | -* **EC Transfer** (`vllm.distributed.ec_transfer`): Optimized for sharing encoder embeddings. |
69 | | -* **Device Communicators** (`vllm.distributed.device_communicators`): Low-level primitives (NCCL, SHM) for Tensor/Pipeline Parallelism. |
70 | | - |
71 | | -`vllm-omni` complements this by introducing a **Generalized Connector Abstraction** (`OmniConnector`) for multimodal pipelines. While vLLM's connectors are artifact-specific, `vllm-omni`: |
72 | | - |
73 | | -1. **Unifies Transport**: Provides a single API (`put`/`get`) to transport *any* stage artifact (Input Embeddings, Hidden States, Audio/Image Tensors, KV Cache, Final Output) between arbitrary pipeline stages (e.g., AudioEncoder -> LLM -> AudioGenerator). |
74 | | -2. **Extends Connectivity**: Enables flexible construction of complex DAGs (Directed Acyclic Graphs) where stages can run in the same process, same node, or across nodes, using the most appropriate backend (SHM, Mooncake, etc.) for each edge. |
75 | | -3. **Wraps & Adapts**: Can internally utilize vLLM's specialized `kv_transfer` for KV paths while using generic transports (SHM/Mooncake) for other data types, presenting a consistent interface to the application layer. |
76 | | - |
77 | | -## 5. Installation (Mooncake) |
78 | | - |
79 | | -If using `MooncakeConnector`, install the library first: |
80 | | - |
81 | | -```bash |
82 | | -# For CUDA-enabled systems (Recommended) |
83 | | -pip install mooncake-transfer-engine |
84 | | - |
85 | | -# For non-CUDA systems |
86 | | -pip install mooncake-transfer-engine-non-cuda |
87 | | -``` |
88 | | - |
89 | | -## 6. Using MooncakeConnector |
90 | | - |
91 | | -### 6.1 Start Mooncake Master |
92 | | - |
93 | | -Start the master service on your primary node: |
94 | | - |
95 | | -```bash |
96 | | -# if you use mooncake SSD storage |
97 | | -mkdir -p ./mc_storage |
| 49 | +### Metadata Passing |
98 | 50 |
|
99 | | -mooncake_master \ |
100 | | - --rpc_port=50051 \ |
101 | | - --enable_http_metadata_server=true \ |
102 | | - --http_metadata_server_host=0.0.0.0 \ |
103 | | - --http_metadata_server_port=8080 \ |
104 | | - --metrics_port=9003 \ |
105 | | - --root_fs_dir=./mc_storage/ \ |
106 | | - --cluster_id=mc-local-1 & |
107 | | -``` |
108 | | - |
109 | | -### 6.2 Configuration (YAML) |
| 51 | +Some connectors (e.g., SharedMemoryConnector) generate transient resources during `put()`. |
| 52 | +This `metadata` must be passed through the control plane so `get()` can locate the data. |
110 | 53 |
|
111 | | -Edit your stage config (e.g., `qwen2_5_omni.yaml`). |
| 54 | +## Configuration Model |
112 | 55 |
|
113 | | -**Step 1: Define Connector in Global Runtime** |
| 56 | +Define connectors in runtime: |
114 | 57 |
|
115 | 58 | ```yaml |
116 | 59 | runtime: |
117 | 60 | connectors: |
118 | | - connector_of_mooncake: |
119 | | - name: MooncakeConnector |
| 61 | + connector_of_shared_memory: |
| 62 | + name: SharedMemoryConnector |
120 | 63 | extra: |
121 | | - host: "127.0.0.1" # Local Worker IP |
122 | | - metadata_server: "http://<MASTER_IP>:8080/metadata" |
123 | | - master: "<MASTER_IP>:50051" |
124 | | - segment: 512000000 # 512MB segment |
125 | | - localbuf: 64000000 # 64MB buffer |
126 | | - proto: "tcp" # "tcp" or "rdma" |
127 | | - ``` |
128 | | -
|
129 | | -**Mooncake Configuration Parameters:** |
130 | | -
|
131 | | -* **host**: The hostname or IP address of the local machine (worker). Mooncake uses this to register itself in the metadata server so other nodes can find it. |
132 | | -* **metadata_server**: The URL of the metadata server. This is used for service discovery and connection establishment (e.g., exchanging QP information for RDMA). |
133 | | -* **master**: The address of the Mooncake Master Server (e.g., `<MASTER_IP>:50051`). This is used for global state management and control plane operations. |
134 | | -* **segment**: The size of the global memory segment in bytes (default: ~512MB). This defines the shared memory region accessible by Mooncake for data transfer. |
135 | | -* **localbuf**: The size of the local buffer in bytes (default: ~64MB). Used for local data buffering during transfer operations. |
136 | | -* **proto**: The transport protocol to use. Options: |
137 | | - * `tcp`: Standard TCP/IP (easier setup, universal compatibility). |
138 | | - * `rdma`: Remote Direct Memory Access (higher performance, requires RDMA-capable hardware). |
139 | | - |
140 | | -For more details, refer to the [Mooncake Repository](https://github.com/kvcache-ai/Mooncake). |
141 | | - |
142 | | - **Step 2: Reference in Stages** |
| 64 | + shm_threshold_bytes: 65536 |
| 65 | +``` |
143 | 66 |
|
144 | | -Explicitly link stages using `input_connectors` and `output_connectors`: |
| 67 | +Wire stages to connectors: |
145 | 68 |
|
146 | 69 | ```yaml |
147 | 70 | stage_args: |
148 | 71 | - stage_id: 0 |
149 | | - # ... |
150 | 72 | output_connectors: |
151 | | - to_stage_1: connector_of_mooncake |
| 73 | + to_stage_1: connector_of_shared_memory |
152 | 74 |
|
153 | 75 | - stage_id: 1 |
154 | | - # ... |
155 | 76 | input_connectors: |
156 | | - from_stage_0: connector_of_mooncake |
| 77 | + from_stage_0: connector_of_shared_memory |
157 | 78 | ``` |
158 | 79 |
|
159 | | -## 7. Using SharedMemoryConnector (Auto-Mode) |
160 | | - |
161 | | -**Best for single-node.** |
162 | | - |
163 | | -The system will automatically create `SharedMemoryConnector`s for any pipeline edge that does not have an explicit connector defined. This is inferred from: |
164 | | -1. `runtime.edges` list in the config. |
165 | | -2. `engine_input_source` dependencies defined in `stage_args`. |
| 80 | +If a pipeline edge has no explicit connector, the system auto-creates a |
| 81 | +SharedMemoryConnector for that edge. |
166 | 82 |
|
167 | | -### Threshold Configuration |
168 | | -By default, payloads larger than **64KB** (default threshold) are transferred via shared memory, while smaller ones use the control queue (inline). |
| 83 | +## Relationship with vLLM |
169 | 84 |
|
170 | | -To adjust this threshold (e.g., to 1GB), add the following to your `runtime.connectors`: |
171 | | - |
172 | | -```yaml |
173 | | -runtime: |
174 | | - connectors: |
175 | | - connector_of_shared_memory: |
176 | | - name: SharedMemoryConnector |
177 | | - extra: |
178 | | - shm_threshold_bytes: 1024 # 1KB threshold |
179 | | -``` |
180 | | - |
181 | | -## 8. Summary |
| 85 | +vLLM provides specialized distributed mechanisms for specific artifacts: |
182 | 86 |
|
183 | | -| Use Case | Recommended Connector | Configuration | |
184 | | -| :--- | :--- | :--- | |
185 | | -| **Single Node** | `SharedMemoryConnector` | **None** (Automatic) or Custom Threshold | |
186 | | -| **Multi Node** | `MooncakeConnector` | Explicit YAML + Mooncake Master | |
| 87 | +- KV Transfer (`vllm.distributed.kv_transfer`): optimized for KV caches. |
| 88 | +- EC Transfer (`vllm.distributed.ec_transfer`): optimized for encoder embeddings. |
| 89 | +- Device Communicators (`vllm.distributed.device_communicators`): low-level primitives (NCCL, SHM). |
187 | 90 |
|
188 | | -## 9. Operational Notes (important) |
| 91 | +vllm-omni complements this with a generalized connector abstraction: |
189 | 92 |
|
190 | | -- **Fail-fast config validation**: the loader raises if any expected edge is missing a connector. Define `input_connectors`/`output_connectors` or rely on auto-SHM filling; otherwise startup aborts. |
191 | | -- **Missing payloads halt the stage**: workers expect connector payloads; if metadata or connector config is missing, the stage raises and stops. Verify connector wiring and metadata propagation before production. |
| 93 | +1. Unifies transport via a single `put`/`get` API for any stage artifact. |
| 94 | +2. Enables DAG-style pipelines across processes or nodes with per-edge transports. |
| 95 | +3. Can wrap vLLM-specific transfers for KV paths while keeping a consistent interface. |
192 | 96 |
|
193 | | -## 10. Future Roadmap: Device-to-Device (D2D) Transport |
| 97 | +## Operational Notes |
194 | 98 |
|
195 | | -The current implementations (`SharedMemoryConnector`, `MooncakeConnector`) utilize a **D2H2D (Device-to-Host-to-Device)** data path. Tensors are moved to CPU memory (Host) for transport, which incurs PCIe overhead. |
| 99 | +- Fail-fast config validation: missing expected edges cause startup failures. |
| 100 | +- Missing payloads halt stages: verify connector wiring and metadata propagation. |
196 | 101 |
|
197 | | -As outlined in the design RFC, future versions will introduce **D2D (Device-to-Device)** connectors: |
| 102 | +## Future Roadmap: D2D Transport |
198 | 103 |
|
199 | | -* **Goal**: Direct GPU-to-GPU transfer (via NCCL, UCX, or IPC) to minimize latency for large tensor payloads. |
200 | | -* **Mechanism**: The `OmniConnector` API allows `put()` to initiate a transfer and return a lightweight handle (metadata) via the control plane, while the heavy payload flows directly between devices. |
| 104 | +Current connectors use D2H2D paths. Future versions will introduce direct |
| 105 | +device-to-device connectors (NCCL, UCX, IPC) to reduce latency for large |
| 106 | +tensor payloads. |
0 commit comments