|
| 1 | +# vLLM v1 backend Disaggregated Serving with MooncakeConnector |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This guide demonstrates how to use the MooncakeConnector with vLLM v1 backend for disaggregated serving in Prefill-Decode separation architecture. The integration enables efficient cross-node KV cache transfer using RDMA technology. |
| 6 | + |
| 7 | +For more details about Mooncake, please refer to [Mooncake project](https://github.com/kvcache-ai/Mooncake) and [Mooncake documents](https://kvcache-ai.github.io/Mooncake/). |
| 8 | + |
| 9 | +## Installation |
| 10 | + |
| 11 | +### Prerequisites |
| 12 | + |
| 13 | +Install mooncake-transfer-engine through pip: |
| 14 | + |
| 15 | +```bash |
| 16 | +pip install mooncake-transfer-engine |
| 17 | +``` |
| 18 | + |
| 19 | +Note: If you encounter problems such as missing `lib*.so`, you should uninstall this package by `pip3 uninstall mooncake-transfer-engine`, and build the binaries manually according to the [instructions](../build.md). |
| 20 | + |
| 21 | +### Install vLLM |
| 22 | + |
| 23 | +Refer to [vLLM official installation guide](https://docs.vllm.ai/en/latest/getting_started/installation.html) for the latest installation instructions. |
| 24 | + |
| 25 | +## Usage |
| 26 | + |
| 27 | +### Basic Setup (Different Nodes) |
| 28 | + |
| 29 | +#### Prefiller Node (192.168.0.2) |
| 30 | + |
| 31 | +```bash |
| 32 | +vllm serve Qwen/Qwen2.5-7B-Instruct \ |
| 33 | + --port 8010 \ |
| 34 | + --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' |
| 35 | +``` |
| 36 | + |
| 37 | +#### Decoder Node (192.168.0.3) |
| 38 | + |
| 39 | +```bash |
| 40 | +vllm serve Qwen/Qwen2.5-7B-Instruct \ |
| 41 | + --port 8020 \ |
| 42 | + --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' |
| 43 | +``` |
| 44 | + |
| 45 | +#### Proxy Server |
| 46 | + |
| 47 | +```bash |
| 48 | +# In vllm root directory. |
| 49 | +python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \ |
| 50 | + --prefiller-host 192.168.0.2 --prefiller-port 8010 \ |
| 51 | + --decoder-host 192.168.0.3 --decoder-port 8020 |
| 52 | +``` |
| 53 | + |
| 54 | +> NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future. |
| 55 | +
|
| 56 | +Now you can send requests to the proxy server through port 8000. |
| 57 | + |
| 58 | +#### Test |
| 59 | + |
| 60 | +```bash |
| 61 | +curl http://127.0.0.1:8000/v1/chat/completions \ |
| 62 | + -H "Content-Type: application/json" \ |
| 63 | + -d '{ |
| 64 | + "model": "Qwen/Qwen2.5-7B-Instruct", |
| 65 | + "messages": [ |
| 66 | + {"role": "user", "content": "Tell me a long story about artificial intelligence."} |
| 67 | + ] |
| 68 | + }' |
| 69 | +``` |
| 70 | + |
| 71 | +### Advanced Configuration |
| 72 | + |
| 73 | +#### With Tensor Parallelism |
| 74 | + |
| 75 | +**Prefiller:** |
| 76 | + |
| 77 | +```bash |
| 78 | +CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ |
| 79 | +vllm serve Qwen/Qwen2.5-7B-Instruct \ |
| 80 | + --port 8010 \ |
| 81 | + --tensor-parallel-size 8 \ |
| 82 | + --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' |
| 83 | +``` |
| 84 | + |
| 85 | +**Decoder:** |
| 86 | + |
| 87 | +```bash |
| 88 | +CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ |
| 89 | +vllm serve Qwen/Qwen2.5-7B-Instruct \ |
| 90 | + --port 8020 \ |
| 91 | + --tensor-parallel-size 8 \ |
| 92 | + --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' |
| 93 | +``` |
| 94 | + |
| 95 | +#### Configuration Parameters |
| 96 | + |
| 97 | +- `--kv-transfer-config`: JSON string to configure the KV transfer connector |
| 98 | + - `kv_connector`: Set to "MooncakeConnector" |
| 99 | + - `kv_role`: Role of the instance |
| 100 | + - `kv_producer`: For prefiller instances that generate KV caches |
| 101 | + - `kv_consumer`: For decoder instances that consume KV caches |
| 102 | + - `kv_both`: Enables symmetric functionality (experimental) |
| 103 | + - `num_workers`: Thread pool size in each prefiller worker to send kvcache (default 10) |
| 104 | + |
| 105 | +## Environment Variables |
| 106 | + |
| 107 | +The following environment variables can be used to customize Mooncake behavior: |
| 108 | + |
| 109 | +- `VLLM_MOONCAKE_BOOTSTRAP_PORT`: Port for Mooncake bootstrap server |
| 110 | + - Default: 8998 |
| 111 | + - Required only for prefiller instances |
| 112 | + - Each vLLM worker needs a unique port on its host |
| 113 | + - For TP/DP deployments, each worker's port is computed as: `base_port + dp_rank * tp_size + tp_rank` |
| 114 | + |
| 115 | +- `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing KV cache |
| 116 | + - Default: 480 |
| 117 | + - Used when a request is aborted to prevent holding resources indefinitely |
| 118 | + |
| 119 | +## Performance |
| 120 | + |
| 121 | +For detailed performance benchmarks and results, see the [vLLM Benchmark](../../performance/vllm_benchmark.md) documentation. |
| 122 | + |
| 123 | +## Notes |
| 124 | + |
| 125 | +- Tensor parallelism (TP) is supported for both prefiller and decoder instances |
| 126 | +- The proxy server should typically run on the decoder node |
| 127 | +- Ensure network connectivity between prefiller and decoder nodes for RDMA transfer |
| 128 | +- For production deployments, consider using a more robust proxy solution |
| 129 | + |
| 130 | +## Troubleshooting |
| 131 | + |
| 132 | +- If you encounter connection issues, check that: |
| 133 | + - All nodes can reach each other over the network |
| 134 | + - Firewall rules allow traffic on the specified ports |
| 135 | + - RDMA devices are properly configured |
| 136 | +- For missing library errors, rebuild mooncake-transfer-engine from source |
| 137 | +- Enable debug logging with `VLLM_LOGGING_LEVEL=DEBUG` for detailed diagnostics |
0 commit comments