Skip to content

Commit fc07213

Browse files
Azure-Tangnickyc975
authored andcommitted
Add vllm v1 mooncake benchmark and launch guide (kvcache-ai#1223)
* Add vllm v1 mooncake benchmark; add vllm v1 mooncake launch guide * Rename file * add description * Update index.md and user guide
1 parent 752bdf7 commit fc07213

File tree

6 files changed

+248
-0
lines changed

6 files changed

+248
-0
lines changed
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# vLLM v1 backend Disaggregated Serving with MooncakeConnector
2+
3+
## Overview
4+
5+
This guide demonstrates how to use the MooncakeConnector with vLLM v1 backend for disaggregated serving in Prefill-Decode separation architecture. The integration enables efficient cross-node KV cache transfer using RDMA technology.
6+
7+
For more details about Mooncake, please refer to [Mooncake project](https://github.com/kvcache-ai/Mooncake) and [Mooncake documents](https://kvcache-ai.github.io/Mooncake/).
8+
9+
## Installation
10+
11+
### Prerequisites
12+
13+
Install mooncake-transfer-engine through pip:
14+
15+
```bash
16+
pip install mooncake-transfer-engine
17+
```
18+
19+
Note: If you encounter problems such as missing `lib*.so`, you should uninstall this package by `pip3 uninstall mooncake-transfer-engine`, and build the binaries manually according to the [instructions](../build.md).
20+
21+
### Install vLLM
22+
23+
Refer to [vLLM official installation guide](https://docs.vllm.ai/en/latest/getting_started/installation.html) for the latest installation instructions.
24+
25+
## Usage
26+
27+
### Basic Setup (Different Nodes)
28+
29+
#### Prefiller Node (192.168.0.2)
30+
31+
```bash
32+
vllm serve Qwen/Qwen2.5-7B-Instruct \
33+
--port 8010 \
34+
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
35+
```
36+
37+
#### Decoder Node (192.168.0.3)
38+
39+
```bash
40+
vllm serve Qwen/Qwen2.5-7B-Instruct \
41+
--port 8020 \
42+
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
43+
```
44+
45+
#### Proxy Server
46+
47+
```bash
48+
# In vllm root directory.
49+
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
50+
--prefiller-host 192.168.0.2 --prefiller-port 8010 \
51+
--decoder-host 192.168.0.3 --decoder-port 8020
52+
```
53+
54+
> NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.
55+
56+
Now you can send requests to the proxy server through port 8000.
57+
58+
#### Test
59+
60+
```bash
61+
curl http://127.0.0.1:8000/v1/chat/completions \
62+
-H "Content-Type: application/json" \
63+
-d '{
64+
"model": "Qwen/Qwen2.5-7B-Instruct",
65+
"messages": [
66+
{"role": "user", "content": "Tell me a long story about artificial intelligence."}
67+
]
68+
}'
69+
```
70+
71+
### Advanced Configuration
72+
73+
#### With Tensor Parallelism
74+
75+
**Prefiller:**
76+
77+
```bash
78+
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
79+
vllm serve Qwen/Qwen2.5-7B-Instruct \
80+
--port 8010 \
81+
--tensor-parallel-size 8 \
82+
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
83+
```
84+
85+
**Decoder:**
86+
87+
```bash
88+
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
89+
vllm serve Qwen/Qwen2.5-7B-Instruct \
90+
--port 8020 \
91+
--tensor-parallel-size 8 \
92+
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
93+
```
94+
95+
#### Configuration Parameters
96+
97+
- `--kv-transfer-config`: JSON string to configure the KV transfer connector
98+
- `kv_connector`: Set to "MooncakeConnector"
99+
- `kv_role`: Role of the instance
100+
- `kv_producer`: For prefiller instances that generate KV caches
101+
- `kv_consumer`: For decoder instances that consume KV caches
102+
- `kv_both`: Enables symmetric functionality (experimental)
103+
- `num_workers`: Thread pool size in each prefiller worker to send kvcache (default 10)
104+
105+
## Environment Variables
106+
107+
The following environment variables can be used to customize Mooncake behavior:
108+
109+
- `VLLM_MOONCAKE_BOOTSTRAP_PORT`: Port for Mooncake bootstrap server
110+
- Default: 8998
111+
- Required only for prefiller instances
112+
- Each vLLM worker needs a unique port on its host
113+
- For TP/DP deployments, each worker's port is computed as: `base_port + dp_rank * tp_size + tp_rank`
114+
115+
- `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing KV cache
116+
- Default: 480
117+
- Used when a request is aborted to prevent holding resources indefinitely
118+
119+
## Performance
120+
121+
For detailed performance benchmarks and results, see the [vLLM Benchmark](../../performance/vllm_benchmark.md) documentation.
122+
123+
## Notes
124+
125+
- Tensor parallelism (TP) is supported for both prefiller and decoder instances
126+
- The proxy server should typically run on the decoder node
127+
- Ensure network connectivity between prefiller and decoder nodes for RDMA transfer
128+
- For production deployments, consider using a more robust proxy solution
129+
130+
## Troubleshooting
131+
132+
- If you encounter connection issues, check that:
133+
- All nodes can reach each other over the network
134+
- Firewall rules allow traffic on the specified ports
135+
- RDMA devices are properly configured
136+
- For missing library errors, rebuild mooncake-transfer-engine from source
137+
- Enable debug logging with `VLLM_LOGGING_LEVEL=DEBUG` for detailed diagnostics
182 KB
Loading
135 KB
Loading
158 KB
Loading

docs/source/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ This repository also hosts its technical report and the open sourced traces.
2727

2828
<h2 id="updates">🔄 Updates</h2>
2929

30+
- **Dec 18, 2025**: Mooncake has now implemented a vLLM connector, enabling direct support for the Prefill-Decode (PD) separation architecture in vLLM v1.
3031
- **Sept 10, 2025**: SGLang officially supports Mooncake Store as a [hierarchical KV caching storage backend](https://lmsys.org/blog/2025-09-10-sglang-hicache/). The integration extends RadixAttention with multi-tier KV cache storage across device, host, and remote storage layers.
3132
- **Sept 10, 2025**: The official & high-performance version of Mooncake P2P Store is open-sourced as [checkpoint-engine](https://github.com/MoonshotAI/checkpoint-engine/). It has been successfully applied in K1.5 and K2 production training, updating Kimi-K2 model (1T parameters) across thousands of GPUs in ~20s.
3233
- **Aug 23, 2025**: [xLLM](https://github.com/jd-opensource/xllm) high-performance inference engine builds hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching.
@@ -61,6 +62,7 @@ getting_started/plugin-usage/3FS-USRBIO-Plugin
6162
getting_started/examples/lmcache-integration
6263
getting_started/examples/lmdeploy-integration-v0.9
6364
getting_started/examples/sglang-integration-v1
65+
getting_started/examples/vllm-integration-v1
6466
getting_started/examples/sglang-integration/index
6567
getting_started/examples/vllm-integration/index
6668
:::
@@ -75,6 +77,7 @@ performance/sglang-benchmark-results-v1
7577
performance/vllm-benchmark-results-v0.2
7678
performance/vllm-benchmark-results-v1
7779
performance/sglang-hicache-benchmark-results-v1
80+
performance/vllm-v1-support-benchmark
7881
performance/allocator-benchmark-result.md
7982
:::
8083

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# vLLM with Mooncake Transfer Engine Benchmark
2+
3+
Mooncake has now implemented a vLLM connector, enabling direct support for the Prefill-Decode (PD) separation architecture in vLLM v1. We evaluated the performance of this integration, focusing on the efficiency of cross-node KV cache transfer using RDMA.
4+
5+
## Benchmark Result
6+
7+
### Bandwidth Performance
8+
9+
We measured the actual transfer bandwidth during the execution of requests with varying prompt lengths.
10+
11+
![KV Transfer Bandwidth (Actual)](../image/vllm_benchmark_actual_bandwidth.png)
12+
13+
In a 1P1D (1 Prefiller, 1 Decoder) configuration using the Qwen3-8B model, Mooncake achieved a peak actual transfer bandwidth of **142.25 GB/s**. Given the theoretical maximum bandwidth of approximately 200 GB/s for the 8x RoCE connections, this represents a **71.1% bandwidth utilization rate**. This efficiency demonstrates that the custom transfer protocol and GPU Direct RDMA capabilities can effectively saturate high-performance networks.
14+
15+
### End-to-End Latency (TTFT)
16+
17+
We analyzed the Time To First Token (TTFT) to understand the impact of KV transfer overhead on end-to-end latency.
18+
19+
![TTFT Breakdown](../image/vllm_benchmark_ttft_breakdown.png)
20+
21+
![Transfer Time vs KV Size](../image/vllm_benchmark_transfer_time.png)
22+
23+
The results show that Mooncake's high-speed transfer ensures that the overhead of moving KV cache is negligible compared to the computation time. For a prompt length of 32,768 tokens (transferring 4.50 GB of data), the actual KV transfer took only **31.65 ms**, accounting for merely **4.2%** of the total TTFT.
24+
25+
**Detailed Performance Data:**
26+
27+
| Prompt Length | Mean TTFT (ms) | KV Size | Actual Transfer Time (ms) | Actual Bandwidth (GB/s) | Bandwidth Utilization |
28+
|-------------|----------------|---------|-------------------|-----------------|------------|
29+
| 128 tokens | 46.09 | 20 MB | 0.54 | 36.53 | 18.3% |
30+
| 256 tokens | 48.04 | 38 MB | 0.61 | 60.78 | 30.4% |
31+
| 512 tokens | 59.91 | 74 MB | 0.92 | 78.73 | 39.4% |
32+
| 1024 tokens | 67.29 | 146 MB | 1.50 | 95.23 | 47.6% |
33+
| 2048 tokens | 85.31 | 290 MB | 2.51 | 112.88 | 56.4% |
34+
| 4096 tokens | 124.42 | 578 MB | 4.75 | 119.00 | 59.5% |
35+
| 8192 tokens | 212.05 | 1.13 GB | 8.84 | 127.57 | 63.8% |
36+
| 16384 tokens| 387.52 | 2.25 GB | 16.43 | 137.09 | 68.5% |
37+
| 32768 tokens| 749.62 | 4.50 GB | 31.65 | 142.25 | 71.1% |
38+
39+
## Benchmark Setup
40+
41+
### H800 Cluster
42+
43+
Experimental Environment
44+
45+
**Hardware Configuration**: NVIDIA H800 (81GB) x 16 (8 per node), 8x Mellanox ConnectX-7 (RoCE over Ethernet).
46+
**Topology**: Prefiller and Decoder connected via RoCE.
47+
**Model**: Qwen3-8B
48+
**vLLM Version**: 0.11.2.dev358
49+
**KV Connector**: MooncakeConnector
50+
**Transfer Method**: Cross-Node RDMA
51+
52+
### Launch Commands
53+
54+
**Prefiller :**
55+
56+
```bash
57+
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
58+
python -m vllm.entrypoints.openai.api_server \
59+
--model /work/models/Qwen3-8B \
60+
--host 0.0.0.0 --port 8010 \
61+
--tensor-parallel-size 8 \
62+
--no-enable-prefix-caching \
63+
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
64+
```
65+
66+
**Decoder :**
67+
68+
```bash
69+
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
70+
python -m vllm.entrypoints.openai.api_server \
71+
--model /work/models/Qwen3-8B \
72+
--host 0.0.0.0 --port 8020 \
73+
--tensor-parallel-size 8 \
74+
--no-enable-prefix-caching \
75+
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
76+
```
77+
78+
**Proxy (Decoder Node):**
79+
80+
```bash
81+
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
82+
--host 0.0.0.0 --port 8000 \
83+
--prefiller-host 10.0.28.193 --prefiller-port 8010 \
84+
--decoder-host 10.0.28.202 --decoder-port 8020
85+
```
86+
87+
### Benchmark Script
88+
89+
We used `vllm bench serve` to generate traffic with varying prompt lengths.
90+
91+
```bash
92+
for prompt_len in 128 256 512 1024 2048 4096 8192 16384 32768; do
93+
vllm bench serve \
94+
--model /work/models/Qwen3-8B \
95+
--num-prompts 50 \
96+
--random-input-len ${prompt_len} \
97+
--random-output-len 128 \
98+
--base-url http://127.0.0.1:8000 \
99+
--backend openai-chat \
100+
--endpoint /v1/chat/completions \
101+
--max-concurrency 1 \
102+
--dataset-name random
103+
done
104+
```
105+
106+
By the Mooncake Team
107+
108+
© Copyright 2025, Mooncake Team.

0 commit comments

Comments
 (0)