|
| 1 | +# Multi-Node (DeepSeek V3.2) |
| 2 | + |
| 3 | +:::{note} |
| 4 | +Only machines with aarch64 is supported currently, x86 is coming soon. This guide take A3 as the example. |
| 5 | +::: |
| 6 | + |
| 7 | +## Verify Multi-Node Communication Environment |
| 8 | + |
| 9 | +### Physical Layer Requirements: |
| 10 | + |
| 11 | +- The physical machines must be located on the same WLAN, with network connectivity. |
| 12 | +- All NPUs are connected with optical modules, and the connection status must be normal. |
| 13 | + |
| 14 | +### Verification Process: |
| 15 | + |
| 16 | +Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`: |
| 17 | + |
| 18 | +:::::{tab-set} |
| 19 | +::::{tab-item} A2 series |
| 20 | + |
| 21 | +```bash |
| 22 | + # Check the remote switch ports |
| 23 | + for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done |
| 24 | + # Get the link status of the Ethernet ports (UP or DOWN) |
| 25 | + for i in {0..7}; do hccn_tool -i $i -link -g ; done |
| 26 | + # Check the network health status |
| 27 | + for i in {0..7}; do hccn_tool -i $i -net_health -g ; done |
| 28 | + # View the network detected IP configuration |
| 29 | + for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done |
| 30 | + # View gateway configuration |
| 31 | + for i in {0..7}; do hccn_tool -i $i -gateway -g ; done |
| 32 | + # View NPU network configuration |
| 33 | + cat /etc/hccn.conf |
| 34 | +``` |
| 35 | + |
| 36 | +:::: |
| 37 | +::::{tab-item} A3 series |
| 38 | + |
| 39 | +```bash |
| 40 | + # Check the remote switch ports |
| 41 | + for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done |
| 42 | + # Get the link status of the Ethernet ports (UP or DOWN) |
| 43 | + for i in {0..15}; do hccn_tool -i $i -link -g ; done |
| 44 | + # Check the network health status |
| 45 | + for i in {0..15}; do hccn_tool -i $i -net_health -g ; done |
| 46 | + # View the network detected IP configuration |
| 47 | + for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done |
| 48 | + # View gateway configuration |
| 49 | + for i in {0..15}; do hccn_tool -i $i -gateway -g ; done |
| 50 | + # View NPU network configuration |
| 51 | + cat /etc/hccn.conf |
| 52 | +``` |
| 53 | + |
| 54 | +:::: |
| 55 | +::::: |
| 56 | + |
| 57 | +### NPU Interconnect Verification: |
| 58 | +#### 1. Get NPU IP Addresses |
| 59 | +:::::{tab-set} |
| 60 | +::::{tab-item} A2 series |
| 61 | + |
| 62 | +```bash |
| 63 | +for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done |
| 64 | +``` |
| 65 | + |
| 66 | +:::: |
| 67 | +::::{tab-item} A3 series |
| 68 | + |
| 69 | +```bash |
| 70 | +for i in {0..15}; do hccn_tool -i $i -ip -g | grep ipaddr; done |
| 71 | +``` |
| 72 | + |
| 73 | +:::: |
| 74 | +::::: |
| 75 | + |
| 76 | +#### 2. Cross-Node PING Test |
| 77 | + |
| 78 | +```bash |
| 79 | +# Execute on the target node (replace with actual IP) |
| 80 | +hccn_tool -i 0 -ping -g address 10.20.0.20 |
| 81 | +``` |
| 82 | + |
| 83 | +## Deploy DeepSeek-V3.2-Exp with vLLM-Ascend: |
| 84 | + |
| 85 | +Currently, we provide a all-in-one image (include CANN 8.2RC1 + [SparseFlashAttention/LightningIndexer](https://gitcode.com/cann/cann-recipes-infer/tree/master/ops/ascendc) + [MLAPO](https://github.com/vllm-project/vllm-ascend/pull/3226)). You can also build your own image refer to [link](https://github.com/vllm-project/vllm-ascend/issues/3278). |
| 86 | + |
| 87 | +- `DeepSeek-V3.2-Exp`: requreid 2 Atlas 800 A3(64G*16) nodes or 4 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16) |
| 88 | +- `DeepSeek-V3.2-Exp-w8a8`: requreid 1 Atlas 800 A3(64G*16) node or 2 Atlas 800 A2(64G*8). [Model weight link](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8) |
| 89 | + |
| 90 | +Run the following command to start the container in each node(This guide suppose you have download the weight to /root/.cache already): |
| 91 | + |
| 92 | +:::::{tab-set} |
| 93 | +::::{tab-item} A2 series |
| 94 | + |
| 95 | +```{code-block} bash |
| 96 | + :substitutions: |
| 97 | +# Update the vllm-ascend image |
| 98 | +# export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0-a3-deepseek-v3.2-exp |
| 99 | +export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc0-a3-deepseek-v3.2-exp |
| 100 | +export NAME=vllm-ascend |
| 101 | +
|
| 102 | +# Run the container using the defined variables |
| 103 | +# Note if you are running bridge network with docker, Please expose available ports |
| 104 | +# for multiple nodes communication in advance |
| 105 | +docker run --rm \ |
| 106 | +--name $NAME \ |
| 107 | +--net=host \ |
| 108 | +--device /dev/davinci0 \ |
| 109 | +--device /dev/davinci1 \ |
| 110 | +--device /dev/davinci2 \ |
| 111 | +--device /dev/davinci3 \ |
| 112 | +--device /dev/davinci4 \ |
| 113 | +--device /dev/davinci5 \ |
| 114 | +--device /dev/davinci6 \ |
| 115 | +--device /dev/davinci7 \ |
| 116 | +--device /dev/davinci_manager \ |
| 117 | +--device /dev/devmm_svm \ |
| 118 | +--device /dev/hisi_hdc \ |
| 119 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 120 | +-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ |
| 121 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 122 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 123 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 124 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 125 | +-v /root/.cache:/root/.cache \ |
| 126 | +-it $IMAGE bash |
| 127 | +``` |
| 128 | + |
| 129 | +:::: |
| 130 | +::::{tab-item} A3 series |
| 131 | + |
| 132 | +```{code-block} bash |
| 133 | + :substitutions: |
| 134 | +# Update the vllm-ascend image |
| 135 | +# export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0-a3-deepseek-v3.2-exp |
| 136 | +export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc0-a3-deepseek-v3.2-exp |
| 137 | +export NAME=vllm-ascend |
| 138 | +
|
| 139 | +# Run the container using the defined variables |
| 140 | +# Note if you are running bridge network with docker, Please expose available ports |
| 141 | +# for multiple nodes communication in advance |
| 142 | +docker run --rm \ |
| 143 | +--name $NAME \ |
| 144 | +--net=host \ |
| 145 | +--device /dev/davinci0 \ |
| 146 | +--device /dev/davinci1 \ |
| 147 | +--device /dev/davinci2 \ |
| 148 | +--device /dev/davinci3 \ |
| 149 | +--device /dev/davinci4 \ |
| 150 | +--device /dev/davinci5 \ |
| 151 | +--device /dev/davinci6 \ |
| 152 | +--device /dev/davinci7 \ |
| 153 | +--device /dev/davinci8 \ |
| 154 | +--device /dev/davinci9 \ |
| 155 | +--device /dev/davinci10 \ |
| 156 | +--device /dev/davinci11 \ |
| 157 | +--device /dev/davinci12 \ |
| 158 | +--device /dev/davinci13 \ |
| 159 | +--device /dev/davinci14 \ |
| 160 | +--device /dev/davinci15 \ |
| 161 | +--device /dev/davinci_manager \ |
| 162 | +--device /dev/devmm_svm \ |
| 163 | +--device /dev/hisi_hdc \ |
| 164 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 165 | +-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ |
| 166 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 167 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 168 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 169 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 170 | +-v /root/.cache:/root/.cache \ |
| 171 | +-it $IMAGE bash |
| 172 | +``` |
| 173 | + |
| 174 | +:::: |
| 175 | +::::: |
| 176 | + |
| 177 | +:::{note} |
| 178 | +We also provide openEuler based image, just need to replace `IMAGE` to `quay.io/ascend/vllm-ascend:v0.11.0rc0-a3-openeuler-deepseek-v3.2-exp` |
| 179 | +::: |
| 180 | + |
| 181 | +:::::{tab-set} |
| 182 | +::::{tab-item} DeepSeek-V3.2-Exp A3 series |
| 183 | + |
| 184 | +Run the following scripts on two nodes respectively |
| 185 | + |
| 186 | +:::{note} |
| 187 | +Before launch the inference server, ensure the following environment variables are set for multi node communication |
| 188 | +::: |
| 189 | + |
| 190 | +**node0** |
| 191 | + |
| 192 | +```shell |
| 193 | +#!/bin/sh |
| 194 | + |
| 195 | +# this obtained through ifconfig |
| 196 | +# nic_name is the network interface name corresponding to local_ip |
| 197 | +nic_name="xxxx" |
| 198 | +local_ip="xxxx" |
| 199 | + |
| 200 | +export VLLM_USE_MODELSCOPE=True |
| 201 | +export HCCL_IF_IP=$local_ip |
| 202 | +export GLOO_SOCKET_IFNAME=$nic_name |
| 203 | +export TP_SOCKET_IFNAME=$nic_name |
| 204 | +export HCCL_SOCKET_IFNAME=$nic_name |
| 205 | +export OMP_PROC_BIND=false |
| 206 | +export OMP_NUM_THREADS=100 |
| 207 | +export HCCL_BUFFSIZE=1024 |
| 208 | + |
| 209 | +vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \ |
| 210 | +--host 0.0.0.0 \ |
| 211 | +--port 8000 \ |
| 212 | +--data-parallel-size 2 \ |
| 213 | +--data-parallel-size-local 1 \ |
| 214 | +--data-parallel-address $local_ip \ |
| 215 | +--data-parallel-rpc-port 13389 \ |
| 216 | +--tensor-parallel-size 16 \ |
| 217 | +--seed 1024 \ |
| 218 | +--served-model-name deepseek_v3.2 \ |
| 219 | +--enable-expert-parallel \ |
| 220 | +--max-num-seqs 16 \ |
| 221 | +--max-model-len 32768 \ |
| 222 | +--max-num-batched-tokens 32768 \ |
| 223 | +--trust-remote-code \ |
| 224 | +--no-enable-prefix-caching \ |
| 225 | +--gpu-memory-utilization 0.9 \ |
| 226 | +--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}' |
| 227 | +``` |
| 228 | + |
| 229 | +**node1** |
| 230 | + |
| 231 | +```shell |
| 232 | +#!/bin/sh |
| 233 | + |
| 234 | +nic_name="xxx" |
| 235 | +local_ip="xxx" |
| 236 | + |
| 237 | +export VLLM_USE_MODELSCOPE=True |
| 238 | +export HCCL_IF_IP=$local_ip |
| 239 | +export GLOO_SOCKET_IFNAME=$nic_name |
| 240 | +export TP_SOCKET_IFNAME=$nic_name |
| 241 | +export HCCL_SOCKET_IFNAME=$nic_name |
| 242 | +export OMP_PROC_BIND=false |
| 243 | +export OMP_NUM_THREADS=100 |
| 244 | +export HCCL_BUFFSIZE=1024 |
| 245 | + |
| 246 | +vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \ |
| 247 | +--host 0.0.0.0 \ |
| 248 | +--port 8000 \ |
| 249 | +--headless \ |
| 250 | +--data-parallel-size 2 \ |
| 251 | +--data-parallel-size-local 1 \ |
| 252 | +--data-parallel-start-rank 1 \ |
| 253 | +--data-parallel-address <node0_ip> \ |
| 254 | +--data-parallel-rpc-port 13389 \ |
| 255 | +--tensor-parallel-size 16 \ |
| 256 | +--seed 1024 \ |
| 257 | +--served-model-name deepseek_v3.2 \ |
| 258 | +--max-num-seqs 16 \ |
| 259 | +--max-model-len 32768 \ |
| 260 | +--max-num-batched-tokens 32768 \ |
| 261 | +--enable-expert-parallel \ |
| 262 | +--trust-remote-code \ |
| 263 | +--no-enable-prefix-caching \ |
| 264 | +--gpu-memory-utilization 0.92 \ |
| 265 | +--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}' |
| 266 | +``` |
| 267 | + |
| 268 | +:::: |
| 269 | + |
| 270 | +::::{tab-item} DeepSeek-V3.2-Exp-W8A8 A3 series |
| 271 | + |
| 272 | +```shell |
| 273 | +#!/bin/sh |
| 274 | + |
| 275 | +vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp-W8A8 \ |
| 276 | +--host 0.0.0.0 \ |
| 277 | +--port 8000 \ |
| 278 | +--tensor-parallel-size 16 \ |
| 279 | +--seed 1024 \ |
| 280 | +--quantization ascend \ |
| 281 | +--served-model-name deepseek_v3.2 \ |
| 282 | +--max-num-seqs 16 \ |
| 283 | +--max-model-len 32768 \ |
| 284 | +--max-num-batched-tokens 32768 \ |
| 285 | +--enable-expert-parallel \ |
| 286 | +--trust-remote-code \ |
| 287 | +--no-enable-prefix-caching \ |
| 288 | +--gpu-memory-utilization 0.92 \ |
| 289 | +--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}' |
| 290 | +``` |
| 291 | + |
| 292 | +:::: |
| 293 | +::::{tab-item} A2 series |
| 294 | +Just like A3 series, the only difference is to set `--data-parallel-size` to the right value on each node. |
| 295 | + |
| 296 | +:::: |
| 297 | +::::: |
| 298 | + |
| 299 | +Once your server is started, you can query the model with input prompts: |
| 300 | + |
| 301 | +```shell |
| 302 | +curl http://<node0_ip>:<port>/v1/completions \ |
| 303 | + -H "Content-Type: application/json" \ |
| 304 | + -d '{ |
| 305 | + "model": "deepseek_v3.2", |
| 306 | + "prompt": "The future of AI is", |
| 307 | + "max_tokens": 50, |
| 308 | + "temperature": 0 |
| 309 | + }' |
| 310 | +``` |
0 commit comments