Skip to content

Commit 17c2884

Browse files
authored
add document for deepseek large EP (#2339)
### What this PR does / why we need it? This PR presents a large-EP deployment solution based on vllm-ascend, using DeepSeek as an example. It outlines the end-to-end workflow for model deployment and serves as a reference for developers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: hust17yixuan <[email protected]>
1 parent cfed4b9 commit 17c2884

File tree

2 files changed

+254
-0
lines changed

2 files changed

+254
-0
lines changed
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Distributed DP Server With Large EP (DeepSeek)
2+
3+
## Getting Start
4+
5+
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large **Expert Parallelism (EP)** scenario. To achieve better performance,the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment.
6+
7+
## Verify Multi-Node Communication Environment
8+
9+
### Physical Layer Requirements:
10+
11+
- The physical machines must be located on the same WLAN, with network connectivity.
12+
- All NPUs are connected with optical modules, and the connection status must be normal.
13+
14+
### Verification Process:
15+
16+
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
17+
18+
```bash
19+
# Check the remote switch ports
20+
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
21+
# Get the link status of the Ethernet ports (UP or DOWN)
22+
for i in {0..15}; do hccn_tool -i $i -link -g ; done
23+
# Check the network health status
24+
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
25+
# View the network detected IP configuration
26+
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
27+
# View gateway configuration
28+
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
29+
# View NPU network configuration
30+
cat /etc/hccn.conf
31+
```
32+
33+
### NPU Interconnect Verification:
34+
35+
#### 1. Get NPU IP Addresses
36+
37+
```bash
38+
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
39+
```
40+
41+
#### 2. Get superpodid and SDID
42+
43+
```bash
44+
for i in {0..7}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
45+
```
46+
47+
#### 3. Cross-Node PING Test
48+
49+
```bash
50+
# Execute on the target node (replace with actual IP)
51+
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
52+
```
53+
54+
## Generate Ranktable
55+
56+
You need to generate a ranktable to make mulit nodes to communicate with each other. For more details please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/README.md). Execute the following commands for reference.
57+
58+
```shell
59+
cd vllm-ascend/examples/disaggregate_prefill_v1/
60+
bash gen_ranktable.sh --ips prefiller_node1_local_ip prefiller_node2_local_ip decoder_node1_local_ip decoder_node2_local_ip \
61+
--npus-per-node npu_clips --network-card-name nic_name --prefill-device-cnt prefiller_npu_clips --decode-device-cnt decode_npu_clips
62+
```
63+
64+
|Parameter | meaning |
65+
| --- | --- |
66+
| --ips | Each node's local ip (prefiller nodes should be front of decoder nodes) |
67+
| --npus-per-node | Each node's npu clips |
68+
| --network-card-name | The physical machines' NIC |
69+
|--prefill-device-cnt | Npu clips used for prefill |
70+
|--decode-device-cnt |Npu clips used for decode |
71+
72+
## Use the Distributed DP Server
73+
74+
Execute the following commands to use the distributed DP server. (We recommend using this feature on the v0.9.1-dev branch)
75+
76+
```python
77+
import multiprocessing
78+
import os
79+
import sys
80+
dp_size = "total number of DP workers for decode/prefill"
81+
dp_size_local = "number of DP workers on the current node"
82+
dp_rank_start = "starting DP rank for the current node"
83+
dp_ip = "master node ip"
84+
dp_port = "port used for communication"
85+
engine_port = "the starting port for all DP groups on the current node"
86+
template_path = "./run_dp_template.sh"
87+
if not os.path.exists(template_path):
88+
print(f"Template file {template_path} does not exist.")
89+
sys.exit(1)
90+
def run_command(dp_rank_local, dp_rank, engine_port_):
91+
command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
92+
os.system(command)
93+
processes = []
94+
for i in range(dp_size_local):
95+
dp_rank = dp_rank_start + i
96+
dp_rank_local = i
97+
engine_port_ = engine_port + i
98+
process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
99+
processes.append(process)
100+
process.start()
101+
for process in processes:
102+
process.join()
103+
```
104+
105+
Note that the prefiller nodes and the decoder nodes may have differenet configurations. You can use the following shell script for configuring the prefiller and decoder nodes respectively.
106+
107+
```shell
108+
# run_dp_template.sh
109+
#!/bin/sh
110+
111+
# this obtained through ifconfig
112+
# nic_name is the network interface name corresponding to local_ip
113+
nic_name="xxxx"
114+
local_ip="xxxx"
115+
116+
# basic configuration for HCCL and connection
117+
export HCCL_IF_IP=$local_ip
118+
export GLOO_SOCKET_IFNAME=$nic_name
119+
export TP_SOCKET_IFNAME=$nic_name
120+
export HCCL_SOCKET_IFNAME=$nic_name
121+
export OMP_PROC_BIND=false
122+
export OMP_NUM_THREADS=10
123+
export HCCL_BUFFSIZE=256
124+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH='ranktable you generate'
125+
126+
# obtain parameters from distributed DP server
127+
export VLLM_DP_SIZE=$1
128+
export VLLM_DP_MASTER_IP=$2
129+
export VLLM_DP_MASTER_PORT=$3
130+
export VLLM_DP_RANK_LOCAL=$4
131+
export VLLM_DP_RANK=$5
132+
export VLLM_DP_SIZE_LOCAL=$7
133+
134+
#pytorch_npu settings and vllm settings
135+
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
136+
export TASK_QUEUE_ENABLE=1
137+
export VLLM_USE_V1=1
138+
139+
# enable the distributed DP server
140+
export VLLM_WORKER_MULTIPROC_METHOD="fork"
141+
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
142+
143+
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
144+
# "--additional-config" is used to enable characteristics from vllm-ascend
145+
vllm serve /root/.cache/ds_r1 \
146+
--host 0.0.0.0 \
147+
--port $6 \
148+
--tensor-parallel-size 8 \
149+
--enable-expert-parallel \
150+
--seed 1024 \
151+
--served-model-name deepseek_r1 \
152+
--max-model-len 17000 \
153+
--max-num-batched-tokens 16384 \
154+
--trust-remote-code \
155+
--max-num-seqs 4 \
156+
--gpu-memory-utilization 0.9 \
157+
--quantization ascend \
158+
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
159+
--kv-transfer-config \
160+
'{"kv_connector": "LLMDataDistCMgrConnector",
161+
"kv_buffer_device": "npu",
162+
"kv_role": "kv_consumer",
163+
"kv_parallel_size": "1",
164+
"kv_port": "20001",
165+
"engine_id": "0",
166+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
167+
}' \
168+
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
169+
```
170+
171+
In the PD separation scenario, we provide a recommended optimized configuration.
172+
173+
- **prefiller node**
174+
175+
1. set HCCL_BUFFSIZE=256
176+
2. add '--enforce-eager' commond to 'vllm serve'
177+
3. Take '--additional-config' as follow
178+
179+
```shell
180+
--additional-config '{"ascend_scheduler_config":{"enabled":false}, "torchair_graph_config":{"enabled":false},"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
181+
```
182+
183+
- **decoder node**
184+
185+
1. set HCCL_BUFFSIZE=1024
186+
2. Take '--additional-config' as follow
187+
188+
```shell
189+
--additional-config '{"ascend_scheduler_config":{"enabled":false}, "torchair_graph_config":{"enabled":true,"enable_multistream_mla":true,"enable_multistream_moe":true,"graph_batch_sizes":[28], "enable_super_kernel":true, "use_cached_graph":true},"enable_weight_nz_layout":true}'
190+
```
191+
192+
<br>
193+
194+
'--additional-config' Parameter Introduction:
195+
196+
- **"torchair_graph_config":** The config options for torchair graph mode.
197+
- **"ascend_scheduler_config":** The config options for ascend scheduler.
198+
- **"enable_weight_nz_layout":** Whether to convert quantized weights to NZ format to accelerate matrix multiplication.
199+
- **"enable_prefill_optimizations":** Whether to enable DeepSeek models' prefill optimizations.
200+
<br>
201+
202+
"torchair_graph_config" Parameter Introduction:
203+
204+
- **"enable_multistream_mla":** Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA.
205+
- **"enable_multistream_moe":** Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models.
206+
- **"graph_batch_sizes":** The batch size for torchair graph cache.
207+
- **"enable_super_kernel":** Whether to enable super kernel.
208+
- **"use_cached_graph":** Whether to use cached graph
209+
210+
## Toy proxy for Distributed DP Server
211+
212+
In the PD separation scenario, we need a proxy to distribute requests. Execute the following commands to enable the toy proxy:
213+
214+
```shell
215+
python load_balance_proxy_server_example.py \
216+
--port "proxy port" \
217+
--host 0.0.0.0 \
218+
--prefiller-hosts \
219+
prefiller node1 local ip \
220+
prefiller node2 local ip \
221+
--prefiller-ports \
222+
engine_port engine_port \
223+
--decoder-hosts \
224+
decoder node1 local ip \
225+
decoder node1 local ip \
226+
decoder node2 local ip \
227+
decoder node2 local ip \
228+
--decoder-ports \
229+
engine_port ... \ # Increase by dp_size_local e.g. 9000 9001
230+
engine_port ... \ # Increase by dp_size_local e.g. 9000 9001
231+
```
232+
233+
:::{note}
234+
Each node local ip should repeat the same times as its '**dp_size_local**', at the same time, each node has the same number of ports as '**dp_size_local**', and ther ports increase sequentially starting from '**engine_port**'.
235+
:::
236+
237+
You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)
238+
239+
## Recommended Configuration
240+
241+
For example,if the average input length is 3.5k, and the output length is 1.1k, the context length is 16k, the max length of the input dataset is 7K. In this scenario, we give a recommended configuration for distributed DP server with high EP. Here we use 4 nodes for prefill and 4 nodes for decode.
242+
<br>
243+
244+
| node | DP | TP | EP | max-model-len | max-num-batched-tokens | max-num-seqs | gpu-memory-utilization |
245+
|----------|----|----|----|---------------|------------------------|--------------|-----------|
246+
| prefill | 2 | 8 | 16 | 17000 | 16384 | 4 | 0.9 |
247+
| decode | 64 | 1 | 64 | 17000 | 256 | 28 | 0.9 |
248+
249+
## FAQ
250+
251+
### 1. Prefiller nodes need to warmup
252+
253+
Since the computation of some NPU operators requires several rounds of warm-up to achieve best performance, we recommend preheating the service with some requests before conducting performance tests to achieve the best end-to-end throughput.

docs/source/developer_guide/performance/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@
66
performance_benchmark
77
profile_execute_duration
88
optimization_and_tuning
9+
distributed_dp_server_with_large_ep
910
:::

0 commit comments

Comments
 (0)