You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Mooncake](https://github.com/kvcache-ai/Mooncake) is a high-performance distributed KV cache transfer engine that enables efficient cross-node data movement via TCP or RDMA, making it ideal for multi-node disaggregated inference.
183
+
184
+
By default, BAGEL uses `SharedMemoryConnector` for inter-stage communication. You can switch to the Mooncake connector for better performance on multi-GPU setups and to enable multi-node deployment.
185
+
186
+
### Prerequisites
187
+
188
+
Install the Mooncake transfer engine:
189
+
190
+
```bash
191
+
# For CUDA-enabled systems (recommended)
192
+
pip install mooncake-transfer-engine
193
+
194
+
# For non-CUDA systems
195
+
pip install mooncake-transfer-engine-non-cuda
196
+
```
197
+
198
+
### Step 1: Start the Mooncake Master
199
+
200
+
On the **primary node**, start the Mooncake master service (run in a separate terminal or background with `&`):
201
+
202
+
```bash
203
+
# Optional: enable disk-backed storage by creating a directory and passing --root_fs_dir.
204
+
# Without it, Mooncake runs in memory-only mode, which is sufficient for KV cache transfer.
205
+
mkdir -p ./mc_storage
206
+
207
+
mooncake_master \
208
+
--rpc_port=50051 \
209
+
--enable_http_metadata_server=true \
210
+
--http_metadata_server_host=0.0.0.0 \
211
+
--http_metadata_server_port=8080 \
212
+
--metrics_port=9003 \
213
+
--root_fs_dir=./mc_storage/ \
214
+
--cluster_id=mc-local-1 &
215
+
```
216
+
217
+
### Step 2: Run Offline Inference with Mooncake
218
+
219
+
Use the provided Mooncake stage config [`bagel_multiconnector.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml). Before launching, update the `metadata_server` and `master` addresses in the YAML to match your Mooncake master node's IP (use `127.0.0.1` for single-node testing).
For more details on the Mooncake connector and multi-node setup, see the [Mooncake Store Connector documentation](../../../docs/design/feature/omni_connectors/mooncake_store_connector.md).
245
+
246
+
------
247
+
180
248
## FAQ
181
249
182
250
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
By default, BAGEL uses `SharedMemoryConnector` for inter-stage communication. You can use the [Mooncake](https://github.com/kvcache-ai/Mooncake) connector to transfer KV cache between stages, which also enables multi-node deployment.
51
+
52
+
**1. Install Mooncake**
53
+
54
+
```bash
55
+
# For CUDA-enabled systems (recommended)
56
+
pip install mooncake-transfer-engine
57
+
58
+
# For non-CUDA systems
59
+
pip install mooncake-transfer-engine-non-cuda
60
+
```
61
+
62
+
**2. Start Mooncake Master** on the primary node:
63
+
64
+
```bash
65
+
# Optional: enable disk-backed storage by creating a directory and passing --root_fs_dir.
66
+
# Without it, Mooncake runs in memory-only mode, which is sufficient for KV cache transfer.
67
+
mkdir -p ./mc_storage
68
+
69
+
mooncake_master \
70
+
--rpc_port=50051 \
71
+
--enable_http_metadata_server=true \
72
+
--http_metadata_server_host=0.0.0.0 \
73
+
--http_metadata_server_port=8080 \
74
+
--metrics_port=9003 \
75
+
--root_fs_dir=./mc_storage/ \
76
+
--cluster_id=mc-local-1 &
77
+
```
78
+
79
+
**3. Launch the server** with the Mooncake stage config:
> **Note**: Before launching, edit [`bagel_multiconnector.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml) and replace the `metadata_server` and `master` addresses with your Mooncake master node's actual IP. For single-node testing, `127.0.0.1` works.
87
+
88
+
The client-side usage is identical to the default setup -- the Mooncake connector is transparent to the API. See the requests section below.
89
+
90
+
For more details on the Mooncake connector configuration, see the [Mooncake Store Connector documentation](../../../docs/design/feature/omni_connectors/mooncake_store_connector.md).
91
+
92
+
#### Multi-Node Deployment
93
+
94
+
You can deploy each stage on a **separate node** for better resource utilization. In this example, the orchestrator (Stage 0 / Thinker) and Stage 1 (DiT) run on different machines, connected via Mooncake.
95
+
96
+
Replace `<ORCHESTRATOR_IP>` below with the actual IP address of your orchestrator node (e.g., `10.244.227.244`).
97
+
98
+
> [!WARNING]
99
+
> **Before launching**, edit [`bagel_multiconnector.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml) and replace the `metadata_server` and `master` addresses with your Mooncake master node's actual IP. Mismatched addresses will cause silent connection failures.
100
+
101
+
**1. Start Mooncake Master** (on the orchestrator node):
102
+
103
+
```bash
104
+
mooncake_master \
105
+
--rpc_port=50051 \
106
+
--enable_http_metadata_server=true \
107
+
--http_metadata_server_host=<ORCHESTRATOR_IP> \
108
+
--http_metadata_server_port=8080 \
109
+
--metrics_port=9003
110
+
```
111
+
112
+
**2. Launch Stage 0 (Thinker / Orchestrator)** on the orchestrator node:
113
+
114
+
```bash
115
+
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
116
+
--port 8000 \ # API server port for client requests
> **Tip**: If nodes are behind a firewall or in different VPCs/security groups, make sure the above ports are allowed in ingress/egress rules. All nodes should be reachable via their IP addresses (no NAT). Using nodes on the same subnet or VPC is recommended to minimize latency for Mooncake KV cache transfers.
0 commit comments