Skip to content

Commit 03ca2b2

Browse files
LCAIZJjianzszzy-ContiLearnfems14DreamerLeader
authored
[P/D] Mooncake Connector for v1 distributed (#1568)
### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.(see detail Mooncake[#502](kvcache-ai/Mooncake#502)) 2. vllm-ascend This PR depends on changes introduced by #950 (modifications to `model_runner_v1`) and #1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1c859a1 --------- Signed-off-by: leichao.lc <[email protected]> Co-authored-by: jianzs <[email protected]> Co-authored-by: zzy-ContiLearn <[email protected]> Co-authored-by: fems14 <[email protected]> Co-authored-by: Dreamerleader <[email protected]> Co-authored-by: chris668899 <[email protected]> Co-authored-by: Pz1116 <[email protected]>
1 parent 2bb7e55 commit 03ca2b2

File tree

4 files changed

+2415
-0
lines changed

4 files changed

+2415
-0
lines changed
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Mooncake connector deployment Guide
2+
3+
## Environmental Dependencies
4+
5+
* Software:
6+
* Python >= 3.9, < 3.12
7+
* CANN >= 8.2.rc1
8+
* PyTorch >= 2.7.1, torch-npu >= 2.7.1.dev20250724
9+
* vLLM (same version as vllm-ascend)
10+
* mooncake-transfer-engine reference documentation: https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/ascend_transport.md
11+
12+
The vllm version must be the same as the main branch of vllm-ascend, for example, 2025/07/30. The version is
13+
14+
* vllm: v0.10.1
15+
* vllm-ascend: v0.10.1rc1
16+
17+
## run
18+
19+
### 1.Run `prefill` Node
20+
21+
```
22+
bash run_prefill.sh
23+
```
24+
25+
Content of the run_prefill.sh script
26+
27+
```
28+
export HCCL_EXEC_TIMEOUT=204
29+
export HCCL_CONNECT_TIMEOUT=120
30+
export HCCL_IF_IP=localhost
31+
export GLOO_SOCKET_IFNAME="xxxxxx"
32+
export TP_SOCKET_IFNAME="xxxxxx"
33+
export HCCL_SOCKET_IFNAME="xxxxxx"
34+
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
35+
36+
vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
37+
--host localhost \
38+
--port 8100 \
39+
--tensor-parallel-size 2\
40+
--seed 1024 \
41+
--max-model-len 2000 \
42+
--max-num-batched-tokens 2000 \
43+
--trust-remote-code \
44+
--enforce-eager \
45+
--data-parallel-size 2 \
46+
--data-parallel-address localhost \
47+
--data-parallel-rpc-port 9100 \
48+
--gpu-memory-utilization 0.8 \
49+
--kv-transfer-config \
50+
'{"kv_connector": "MooncakeConnectorV1",
51+
"kv_buffer_device": "npu",
52+
"kv_role": "kv_producer",
53+
"kv_parallel_size": 1,
54+
"kv_port": "20001",
55+
"engine_id": "0",
56+
"kv_rank": 0,
57+
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
58+
"kv_connector_extra_config": {
59+
"prefill": {
60+
"dp_size": 2,
61+
"tp_size": 2
62+
},
63+
"decode": {
64+
"dp_size": 2,
65+
"tp_size": 2
66+
}
67+
}
68+
}'
69+
```
70+
71+
`HCCL_EXEC_TIMEOUT`, `HCCL_CONNECT_TIMEOUT`, and `HCCL_IF_IP` are hccl-related configurations.<br>
72+
Set `GLOO_SOCKET_IFNAME`, `TP_SOCKET_IFNAME`, and `HCCL_SOCKET_IFNAME` to the corresponding NIC.<br>
73+
`ASCEND_RT_VISIBLE_DEVICES` specifies the cards on which the node run resides. The total number of cards equals `dp_size*tp_size`.<br>
74+
`/xxxxx/DeepSeek-V2-Lite-Chat` is configured as a model that requires run.<br>
75+
`--host`: indicates the IP address of the node to be started.<br>
76+
`--port`: indicates the port to be started, which corresponds to the port in step 4.<br>
77+
`--seed`, --max-model-len, and --max-num-batched-tokens model basic configuration. Set this parameter based on the site requirements.<br>
78+
`--tensor-parallel-size`: specifies the TP size.<br>
79+
`--data-parallel-size`: indicates the DP size.<br>
80+
`--data-parallel-address`: indicates the IP address of the DP. Set this parameter to the IP address of the node.--data-parallel-rpc-port: indicates the RPC port for communication in the DP group.<br>
81+
`--trust-remote-code` can load the local model.<br>
82+
`--enforce-eager` Turn off the map mode<br>
83+
`--gpu-memory-utilization`: Percentage of video memory occupied by the card<br>
84+
`--kv-transfer-config`: follow kv_connector, kv_connector_module_path: mooncakeconnect, kv_buffer_device, and run on the NPU card. For kv_role, set kv_producer to the p node, kv_consumer to the d node, kv_parallel_size to 1, and kv_port to the port used by the node. For the p node, set engine_id and kv_rank to 0 and for the d node to 1. Configure the distributed parallel policy for the p and d nodes in the kv_connector_extra_config file based on --tensor-parallel-size and --data-parallel-size.<br>
85+
86+
87+
### 2. Run `decode` Node
88+
89+
```
90+
bash run_decode.sh
91+
```
92+
93+
Content of the run_decode.sh script
94+
95+
```
96+
export HCCL_EXEC_TIMEOUT=204
97+
export HCCL_CONNECT_TIMEOUT=120
98+
export HCCL_IF_IP=localhost
99+
export GLOO_SOCKET_IFNAME="xxxxxx"
100+
export TP_SOCKET_IFNAME="xxxxxx"
101+
export HCCL_SOCKET_IFNAME="xxxxxx"
102+
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
103+
104+
vllm serve "/xxxxx/DeepSeek-V2-Lite-Chat" \
105+
--host localhost \
106+
--port 8200 \
107+
--tensor-parallel-size 2\
108+
--seed 1024 \
109+
--max-model-len 2000 \
110+
--max-num-batched-tokens 2000 \
111+
--trust-remote-code \
112+
--enforce-eager \
113+
--data-parallel-size 2 \
114+
--data-parallel-address localhost \
115+
--data-parallel-rpc-port 9100 \
116+
--gpu-memory-utilization 0.8 \
117+
--kv-transfer-config \
118+
'{"kv_connector": "MooncakeConnectorV1",
119+
"kv_buffer_device": "npu",
120+
"kv_role": "kv_consumer",
121+
"kv_parallel_size": 1,
122+
"kv_port": "20002",
123+
"engine_id": "1",
124+
"kv_rank": 1,
125+
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
126+
"kv_connector_extra_config": {
127+
"prefill": {
128+
"dp_size": 2,
129+
"tp_size": 2
130+
},
131+
"decode": {
132+
"dp_size": 2,
133+
"tp_size": 2
134+
}
135+
}
136+
}'
137+
```
138+
139+
### 3. Start proxy_server. ###
140+
141+
```
142+
cd /vllm-ascend/examples/disaggregate_prefill_v1/
143+
python load_balance_proxy_server_example.py --host localhost --prefiller-hosts host1 host2 --prefiller-ports 8100 8101 --decoder-hosts host3 host4 --decoder-ports 8200 8201
144+
```
145+
146+
`--host`: indicates the active node. The value of localhost in the curl command delivered in step 5 must be the same as the host. The default port number for starting the service proxy is 8000.<br>
147+
`--prefiller-hosts`: Set this parameter to the IP addresses of all p nodes. In the xpyd scenario, add the IP addresses to the end of this configuration item and leave a blank space between the IP addresses.<br>
148+
`--prefiller-ports`: Set this parameter to the port number of all p nodes, which is the configuration of the port number for the vllm to start the service in step 3. Write the port number after the configuration in sequence and leave a blank space between the port number and the port number. The sequence must be one-to-one mapping to the IP address of --prefiller-hosts.<br>
149+
`--decoder-hosts`: Set this parameter to the IP addresses of all d nodes. In the xpyd scenario, add the IP addresses to the end of this configuration item and leave a blank space between the IP addresses.<br>
150+
`--decoder-ports`: Set this parameter to the port number of all d nodes, which is the configuration of the port number for the vllm to start the service in step 4. Set port to the end of the configuration, and leave a blank space between port and port. The sequence must be one-to-one mapping to the IP address of --decoder-hosts.<br>
151+
152+
153+
### 4. Run Inference
154+
155+
Set the IP address in the inference file to the actual IP address. Set the model variable to the path of the model. Ensure that the path is the same as that in the shell script.
156+
157+
```
158+
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
159+
"model": "model_path",
160+
"prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?",
161+
"max_tokens": 256
162+
}'
163+
```

0 commit comments

Comments
 (0)