Skip to content

Commit 82e1bf2

Browse files
authored
[Doc]: Add bagel single/multi node usage with mooncake document (vllm-project#1450)
1 parent c0908dd commit 82e1bf2

File tree

2 files changed

+191
-0
lines changed

2 files changed

+191
-0
lines changed

examples/offline_inference/bagel/README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,74 @@ Example configuration for TP=2 on GPUs 0 and 1:
177177
| `max_inflight` | `1` | Maximum inflight requests |
178178
| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) |
179179

180+
## Using Mooncake Connector
181+
182+
[Mooncake](https://github.com/kvcache-ai/Mooncake) is a high-performance distributed KV cache transfer engine that enables efficient cross-node data movement via TCP or RDMA, making it ideal for multi-node disaggregated inference.
183+
184+
By default, BAGEL uses `SharedMemoryConnector` for inter-stage communication. You can switch to the Mooncake connector for better performance on multi-GPU setups and to enable multi-node deployment.
185+
186+
### Prerequisites
187+
188+
Install the Mooncake transfer engine:
189+
190+
```bash
191+
# For CUDA-enabled systems (recommended)
192+
pip install mooncake-transfer-engine
193+
194+
# For non-CUDA systems
195+
pip install mooncake-transfer-engine-non-cuda
196+
```
197+
198+
### Step 1: Start the Mooncake Master
199+
200+
On the **primary node**, start the Mooncake master service (run in a separate terminal or background with `&`):
201+
202+
```bash
203+
# Optional: enable disk-backed storage by creating a directory and passing --root_fs_dir.
204+
# Without it, Mooncake runs in memory-only mode, which is sufficient for KV cache transfer.
205+
mkdir -p ./mc_storage
206+
207+
mooncake_master \
208+
--rpc_port=50051 \
209+
--enable_http_metadata_server=true \
210+
--http_metadata_server_host=0.0.0.0 \
211+
--http_metadata_server_port=8080 \
212+
--metrics_port=9003 \
213+
--root_fs_dir=./mc_storage/ \
214+
--cluster_id=mc-local-1 &
215+
```
216+
217+
### Step 2: Run Offline Inference with Mooncake
218+
219+
Use the provided Mooncake stage config [`bagel_multiconnector.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml). Before launching, update the `metadata_server` and `master` addresses in the YAML to match your Mooncake master node's IP (use `127.0.0.1` for single-node testing).
220+
221+
```bash
222+
cd examples/offline_inference/bagel
223+
224+
# Text to Image with Mooncake
225+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
226+
--modality text2img \
227+
--prompts "A cute cat" \
228+
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
229+
230+
# Image to Text with Mooncake
231+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
232+
--modality img2text \
233+
--image-path /path/to/image.jpg \
234+
--prompts "Describe this image" \
235+
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
236+
237+
# Text to Text with Mooncake
238+
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
239+
--modality text2text \
240+
--prompts "What is the capital of France?" \
241+
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
242+
```
243+
244+
For more details on the Mooncake connector and multi-node setup, see the [Mooncake Store Connector documentation](../../../docs/design/feature/omni_connectors/mooncake_store_connector.md).
245+
246+
------
247+
180248
## FAQ
181249

182250
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.

examples/online_serving/bagel/README.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,129 @@ For larger models or multi-GPU environments, you can enable Tensor Parallelism (
4545
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/your/custom_bagel.yaml
4646
```
4747

48+
#### Using Mooncake Connector
49+
50+
By default, BAGEL uses `SharedMemoryConnector` for inter-stage communication. You can use the [Mooncake](https://github.com/kvcache-ai/Mooncake) connector to transfer KV cache between stages, which also enables multi-node deployment.
51+
52+
**1. Install Mooncake**
53+
54+
```bash
55+
# For CUDA-enabled systems (recommended)
56+
pip install mooncake-transfer-engine
57+
58+
# For non-CUDA systems
59+
pip install mooncake-transfer-engine-non-cuda
60+
```
61+
62+
**2. Start Mooncake Master** on the primary node:
63+
64+
```bash
65+
# Optional: enable disk-backed storage by creating a directory and passing --root_fs_dir.
66+
# Without it, Mooncake runs in memory-only mode, which is sufficient for KV cache transfer.
67+
mkdir -p ./mc_storage
68+
69+
mooncake_master \
70+
--rpc_port=50051 \
71+
--enable_http_metadata_server=true \
72+
--http_metadata_server_host=0.0.0.0 \
73+
--http_metadata_server_port=8080 \
74+
--metrics_port=9003 \
75+
--root_fs_dir=./mc_storage/ \
76+
--cluster_id=mc-local-1 &
77+
```
78+
79+
**3. Launch the server** with the Mooncake stage config:
80+
81+
```bash
82+
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 \
83+
--stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
84+
```
85+
86+
> **Note**: Before launching, edit [`bagel_multiconnector.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml) and replace the `metadata_server` and `master` addresses with your Mooncake master node's actual IP. For single-node testing, `127.0.0.1` works.
87+
88+
The client-side usage is identical to the default setup -- the Mooncake connector is transparent to the API. See the requests section below.
89+
90+
For more details on the Mooncake connector configuration, see the [Mooncake Store Connector documentation](../../../docs/design/feature/omni_connectors/mooncake_store_connector.md).
91+
92+
#### Multi-Node Deployment
93+
94+
You can deploy each stage on a **separate node** for better resource utilization. In this example, the orchestrator (Stage 0 / Thinker) and Stage 1 (DiT) run on different machines, connected via Mooncake.
95+
96+
Replace `<ORCHESTRATOR_IP>` below with the actual IP address of your orchestrator node (e.g., `10.244.227.244`).
97+
98+
> [!WARNING]
99+
> **Before launching**, edit [`bagel_multiconnector.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml) and replace the `metadata_server` and `master` addresses with your Mooncake master node's actual IP. Mismatched addresses will cause silent connection failures.
100+
101+
**1. Start Mooncake Master** (on the orchestrator node):
102+
103+
```bash
104+
mooncake_master \
105+
--rpc_port=50051 \
106+
--enable_http_metadata_server=true \
107+
--http_metadata_server_host=<ORCHESTRATOR_IP> \
108+
--http_metadata_server_port=8080 \
109+
--metrics_port=9003
110+
```
111+
112+
**2. Launch Stage 0 (Thinker / Orchestrator)** on the orchestrator node:
113+
114+
```bash
115+
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
116+
--port 8000 \ # API server port for client requests
117+
--stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \
118+
--stage-id 0 \
119+
-oma <ORCHESTRATOR_IP> \
120+
-omp 8091
121+
```
122+
123+
**3. Launch Stage 1 (DiT)** on the remote node in headless mode:
124+
125+
```bash
126+
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
127+
--stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \
128+
--stage-id 1 \
129+
--headless \
130+
-oma <ORCHESTRATOR_IP> \
131+
-omp 8091
132+
```
133+
134+
**Mooncake Master arguments:**
135+
136+
| Argument | Description |
137+
| :------- | :---------- |
138+
| `--rpc_port` | Mooncake RPC port for control-plane coordination between stages |
139+
| `--enable_http_metadata_server` | Enable the HTTP metadata server for service discovery |
140+
| `--http_metadata_server_host` | IP address to bind the metadata server (use the orchestrator node's IP) |
141+
| `--http_metadata_server_port` | Port for the HTTP metadata server |
142+
| `--metrics_port` | Port for Prometheus-compatible metrics endpoint |
143+
144+
**vllm serve arguments:**
145+
146+
| Argument | Description |
147+
| :------- | :---------- |
148+
| `--stage-id` | Which stage this process runs (0 = Thinker, 1 = DiT) |
149+
| `--headless` | Run without the API server (worker-only mode) |
150+
| `-oma` | Orchestrator master address |
151+
| `-omp` | Orchestrator master port for Stage 1 to connect to Stage 0 for task coordination |
152+
153+
> [!IMPORTANT]
154+
> **Startup Order**: Stage 0 (orchestrator) must be launched **before** Stage 1 (headless).
155+
> Stage 0 will appear to hang on startup until Stage 1 (worker) connects — this is expected behavior.
156+
157+
**Network Requirements**
158+
159+
All nodes must have network connectivity to each other. Ensure the following ports are open **between all participating nodes**:
160+
161+
| Port | Protocol | Service | Direction |
162+
| :--- | :------- | :------ | :-------- |
163+
| 50051 | TCP | Mooncake Master RPC | Worker → Orchestrator |
164+
| 8080 | TCP | Mooncake HTTP Metadata Server | Worker → Orchestrator |
165+
| 8091 | TCP | Orchestrator Master (`-omp`) | Worker → Orchestrator |
166+
| 8000 | TCP | API Server (`--port`) | Client → Orchestrator |
167+
| 9003 | TCP | Metrics (optional) | Monitoring → Orchestrator |
168+
169+
> **Tip**: If nodes are behind a firewall or in different VPCs/security groups, make sure the above ports are allowed in ingress/egress rules. All nodes should be reachable via their IP addresses (no NAT). Using nodes on the same subnet or VPC is recommended to minimize latency for Mooncake KV cache transfers.
170+
48171
### Send Multi-modal Request
49172

50173
Get into the bagel folder:

0 commit comments

Comments
 (0)