nickyc975
diff --git a/‎docs/source/getting_started/examples/vllm-integration-v1.md‎
Lines changed: 137 additions & 0 deletions b/‎docs/source/getting_started/examples/vllm-integration-v1.md‎
Lines changed: 137 additions & 0 deletions
diff --git a/‎docs/source/image/vllm_benchmark_actual_bandwidth.png‎
182 KB b/‎docs/source/image/vllm_benchmark_actual_bandwidth.png‎
182 KB
diff --git a/‎docs/source/image/vllm_benchmark_transfer_time.png‎
135 KB b/‎docs/source/image/vllm_benchmark_transfer_time.png‎
135 KB
diff --git a/‎docs/source/image/vllm_benchmark_ttft_breakdown.png‎
158 KB b/‎docs/source/image/vllm_benchmark_ttft_breakdown.png‎
158 KB
diff --git a/‎docs/source/index.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/index.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/performance/vllm-v1-support-benchmark.md‎
Lines changed: 108 additions & 0 deletions b/‎docs/source/performance/vllm-v1-support-benchmark.md‎
Lines changed: 108 additions & 0 deletions
@@ -0,0 +1,137 @@
+# vLLM v1 backend Disaggregated Serving with MooncakeConnector
+
+## Overview
+
+This guide demonstrates how to use the MooncakeConnector with vLLM v1 backend for disaggregated serving in Prefill-Decode separation architecture. The integration enables efficient cross-node KV cache transfer using RDMA technology.
+
+For more details about Mooncake, please refer to [Mooncake project](https://github.com/kvcache-ai/Mooncake) and [Mooncake documents](https://kvcache-ai.github.io/Mooncake/).
+
+## Installation
+
+### Prerequisites
+
+Install mooncake-transfer-engine through pip:
+
+```bash
+pip install mooncake-transfer-engine
+```
+
+Note: If you encounter problems such as missing `lib*.so`, you should uninstall this package by `pip3 uninstall mooncake-transfer-engine`, and build the binaries manually according to the [instructions](../build.md).
+
+### Install vLLM
+
+Refer to [vLLM official installation guide](https://docs.vllm.ai/en/latest/getting_started/installation.html) for the latest installation instructions.
+
+## Usage
+
+### Basic Setup (Different Nodes)
+
+#### Prefiller Node (192.168.0.2)
+
+```bash
+vllm serve Qwen/Qwen2.5-7B-Instruct \
+  --port 8010 \
+  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
+```
+
+#### Decoder Node (192.168.0.3)
+
+```bash
+vllm serve Qwen/Qwen2.5-7B-Instruct \
+  --port 8020 \
+  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
+```
+
+#### Proxy Server
+
+```bash
+# In vllm root directory. 
+python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
+  --prefiller-host 192.168.0.2 --prefiller-port 8010 \
+  --decoder-host 192.168.0.3 --decoder-port 8020
+```
+
+> NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.
+
+Now you can send requests to the proxy server through port 8000.
+
+#### Test
+
+```bash
+curl http://127.0.0.1:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-7B-Instruct",
+    "messages": [
+      {"role": "user", "content": "Tell me a long story about artificial intelligence."}
+    ]
+  }'
+```
+
+### Advanced Configuration
+
+#### With Tensor Parallelism
+
+**Prefiller:**
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+vllm serve Qwen/Qwen2.5-7B-Instruct \
+  --port 8010 \
+  --tensor-parallel-size 8 \
+  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
+```
+
+**Decoder:**
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+vllm serve Qwen/Qwen2.5-7B-Instruct \
+  --port 8020 \
+  --tensor-parallel-size 8 \
+  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
+```
+
+#### Configuration Parameters
+
+- `--kv-transfer-config`: JSON string to configure the KV transfer connector
+  - `kv_connector`: Set to "MooncakeConnector"
+  - `kv_role`: Role of the instance
+    - `kv_producer`: For prefiller instances that generate KV caches
+    - `kv_consumer`: For decoder instances that consume KV caches
+    - `kv_both`: Enables symmetric functionality (experimental)
+    - `num_workers`: Thread pool size in each prefiller worker to send kvcache (default 10)
+
+## Environment Variables
+
+The following environment variables can be used to customize Mooncake behavior:
+
+- `VLLM_MOONCAKE_BOOTSTRAP_PORT`: Port for Mooncake bootstrap server
+  - Default: 8998
+  - Required only for prefiller instances
+  - Each vLLM worker needs a unique port on its host
+  - For TP/DP deployments, each worker's port is computed as: `base_port + dp_rank * tp_size + tp_rank`
+
+- `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing KV cache
+  - Default: 480
+  - Used when a request is aborted to prevent holding resources indefinitely
+
+## Performance
+
+For detailed performance benchmarks and results, see the [vLLM Benchmark](../../performance/vllm_benchmark.md) documentation.
+
+## Notes
+
+- Tensor parallelism (TP) is supported for both prefiller and decoder instances
+- The proxy server should typically run on the decoder node
+- Ensure network connectivity between prefiller and decoder nodes for RDMA transfer
+- For production deployments, consider using a more robust proxy solution
+
+## Troubleshooting
+
+- If you encounter connection issues, check that:
+  - All nodes can reach each other over the network
+  - Firewall rules allow traffic on the specified ports
+  - RDMA devices are properly configured
+- For missing library errors, rebuild mooncake-transfer-engine from source
+- Enable debug logging with `VLLM_LOGGING_LEVEL=DEBUG` for detailed diagnostics
@@ -27,6 +27,7 @@ This repository also hosts its technical report and the open sourced traces.
 
 <h2 id="updates">🔄 Updates</h2>
 
+ - **Dec 18, 2025**: Mooncake has now implemented a vLLM connector, enabling direct support for the Prefill-Decode (PD) separation architecture in vLLM v1.
  - **Sept 10, 2025**: SGLang officially supports Mooncake Store as a [hierarchical KV caching storage backend](https://lmsys.org/blog/2025-09-10-sglang-hicache/). The integration extends RadixAttention with multi-tier KV cache storage across device, host, and remote storage layers.
  - **Sept 10, 2025**: The official & high-performance version of Mooncake P2P Store is open-sourced as [checkpoint-engine](https://github.com/MoonshotAI/checkpoint-engine/). It has been successfully applied in K1.5 and K2 production training, updating Kimi-K2 model (1T parameters) across thousands of GPUs in ~20s.
  - **Aug 23, 2025**: [xLLM](https://github.com/jd-opensource/xllm) high-performance inference engine builds hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching.
@@ -61,6 +62,7 @@ getting_started/plugin-usage/3FS-USRBIO-Plugin
 getting_started/examples/lmcache-integration
 getting_started/examples/lmdeploy-integration-v0.9
 getting_started/examples/sglang-integration-v1
+getting_started/examples/vllm-integration-v1
 getting_started/examples/sglang-integration/index
 getting_started/examples/vllm-integration/index
 :::
@@ -75,6 +77,7 @@ performance/sglang-benchmark-results-v1
 performance/vllm-benchmark-results-v0.2
 performance/vllm-benchmark-results-v1
 performance/sglang-hicache-benchmark-results-v1
+performance/vllm-v1-support-benchmark
 performance/allocator-benchmark-result.md
 :::
 
 
@@ -0,0 +1,108 @@
+# vLLM with Mooncake Transfer Engine Benchmark
+
+Mooncake has now implemented a vLLM connector, enabling direct support for the Prefill-Decode (PD) separation architecture in vLLM v1. We evaluated the performance of this integration, focusing on the efficiency of cross-node KV cache transfer using RDMA.
+
+## Benchmark Result
+
+### Bandwidth Performance
+
+We measured the actual transfer bandwidth during the execution of requests with varying prompt lengths.
+
+![KV Transfer Bandwidth (Actual)](../image/vllm_benchmark_actual_bandwidth.png)
+
+In a 1P1D (1 Prefiller, 1 Decoder) configuration using the Qwen3-8B model, Mooncake achieved a peak actual transfer bandwidth of **142.25 GB/s**. Given the theoretical maximum bandwidth of approximately 200 GB/s for the 8x RoCE connections, this represents a **71.1% bandwidth utilization rate**. This efficiency demonstrates that the custom transfer protocol and GPU Direct RDMA capabilities can effectively saturate high-performance networks.
+
+### End-to-End Latency (TTFT)
+
+We analyzed the Time To First Token (TTFT) to understand the impact of KV transfer overhead on end-to-end latency.
+
+![TTFT Breakdown](../image/vllm_benchmark_ttft_breakdown.png)
+
+![Transfer Time vs KV Size](../image/vllm_benchmark_transfer_time.png)
+
+The results show that Mooncake's high-speed transfer ensures that the overhead of moving KV cache is negligible compared to the computation time. For a prompt length of 32,768 tokens (transferring 4.50 GB of data), the actual KV transfer took only **31.65 ms**, accounting for merely **4.2%** of the total TTFT.
+
+**Detailed Performance Data:**
+
+| Prompt Length | Mean TTFT (ms) | KV Size | Actual Transfer Time (ms) | Actual Bandwidth (GB/s) | Bandwidth Utilization |
+|-------------|----------------|---------|-------------------|-----------------|------------|
+| 128 tokens  | 46.09          | 20 MB   | 0.54              | 36.53           | 18.3%      |
+| 256 tokens  | 48.04          | 38 MB   | 0.61              | 60.78           | 30.4%      |
+| 512 tokens  | 59.91          | 74 MB   | 0.92              | 78.73           | 39.4%      |
+| 1024 tokens | 67.29          | 146 MB  | 1.50              | 95.23           | 47.6%      |
+| 2048 tokens | 85.31          | 290 MB  | 2.51              | 112.88          | 56.4%      |
+| 4096 tokens | 124.42         | 578 MB  | 4.75              | 119.00          | 59.5%      |
+| 8192 tokens | 212.05         | 1.13 GB | 8.84              | 127.57          | 63.8%      |
+| 16384 tokens| 387.52         | 2.25 GB | 16.43             | 137.09          | 68.5%      |
+| 32768 tokens| 749.62         | 4.50 GB | 31.65             | 142.25          | 71.1%      |
+
+## Benchmark Setup
+
+### H800 Cluster
+
+Experimental Environment
+
+• **Hardware Configuration**: NVIDIA H800 (81GB) x 16 (8 per node), 8x Mellanox ConnectX-7 (RoCE over Ethernet).
+• **Topology**: Prefiller and Decoder connected via RoCE.
+• **Model**: Qwen3-8B
+• **vLLM Version**: 0.11.2.dev358
+• **KV Connector**: MooncakeConnector
+• **Transfer Method**: Cross-Node RDMA
+
+### Launch Commands
+
+**Prefiller :**
+
+```bash
+VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+  python -m vllm.entrypoints.openai.api_server \
+  --model /work/models/Qwen3-8B \
+  --host 0.0.0.0 --port 8010 \
+  --tensor-parallel-size 8 \
+  --no-enable-prefix-caching \
+  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
+```
+
+**Decoder :**
+
+```bash
+VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+  python -m vllm.entrypoints.openai.api_server \
+  --model /work/models/Qwen3-8B \
+  --host 0.0.0.0 --port 8020 \
+  --tensor-parallel-size 8 \
+  --no-enable-prefix-caching \
+  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
+```
+
+**Proxy (Decoder Node):**
+
+```bash
+python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
+  --host 0.0.0.0 --port 8000 \
+  --prefiller-host 10.0.28.193 --prefiller-port 8010 \
+  --decoder-host 10.0.28.202 --decoder-port 8020
+```
+
+### Benchmark Script
+
+We used `vllm bench serve` to generate traffic with varying prompt lengths.
+
+```bash
+for prompt_len in 128 256 512 1024 2048 4096 8192 16384 32768; do
+  vllm bench serve \
+    --model /work/models/Qwen3-8B \
+    --num-prompts 50 \
+    --random-input-len ${prompt_len} \
+    --random-output-len 128 \
+    --base-url http://127.0.0.1:8000 \
+    --backend openai-chat \
+    --endpoint /v1/chat/completions \
+    --max-concurrency 1 \
+    --dataset-name random
+done
+```
+
+By the Mooncake Team
+
+© Copyright 2025, Mooncake Team.