@@ -26,11 +26,21 @@ It is recommended to download the model weight to the shared directory of multip
2626
2727vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our official docker images and upgrade vllm and vllm-ascend for inference.
2828
29+ :::::{tab-set}
30+ :sync-group: install
31+
32+ ::::{tab-item} A3 series
33+ :sync: A3
34+
35+ Start the docker image on your each node.
36+
2937``` {code-block} bash
38+ :substitutions:
39+
3040# Update --device according to your device (Atlas A3:/dev/davinci[0-15]).
3141# Update the vllm-ascend image according to your environment.
3242# Note you should download the weight to /root/.cache in advance.
33- # Update the vllm-ascend image, alm5 -a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler
43+ # Update the vllm-ascend image, glm5 -a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler
3444export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:glm5-a3
3545export NAME=vllm-ascend
3646
@@ -69,6 +79,44 @@ docker run --rm \
6979-it $IMAGE bash
7080```
7181
82+ ::::
83+ ::::{tab-item} A2 series
84+ :sync: A2
85+
86+ Start the docker image on your each node.
87+
88+ ``` {code-block} bash
89+ :substitutions:
90+
91+ export IMAGE=quay.io/ascend/vllm-ascend:glm5
92+ docker run --rm \
93+ --name vllm-ascend \
94+ --shm-size=1g \
95+ --net=host \
96+ --device /dev/davinci0 \
97+ --device /dev/davinci1 \
98+ --device /dev/davinci2 \
99+ --device /dev/davinci3 \
100+ --device /dev/davinci4 \
101+ --device /dev/davinci5 \
102+ --device /dev/davinci6 \
103+ --device /dev/davinci7 \
104+ --device /dev/davinci_manager \
105+ --device /dev/devmm_svm \
106+ --device /dev/hisi_hdc \
107+ -v /usr/local/dcmi:/usr/local/dcmi \
108+ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
109+ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
110+ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
111+ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
112+ -v /etc/ascend_install.info:/etc/ascend_install.info \
113+ -v /root/.cache:/root/.cache \
114+ -it $IMAGE bash
115+ ```
116+
117+ ::::
118+ :::::
119+
72120In addition, if you don't want to use the docker image as above, you can also build all from source:
73121
74122- Install ` vllm-ascend ` from source, refer to [ installation] ( https://docs.vllm.ai/projects/ascend/en/latest/installation.html ) .
@@ -99,17 +147,18 @@ If you want to deploy multi-node environment, you need to set up environment on
99147
100148### Single-node Deployment
101149
102- ** A2 series**
103-
104- Not test yet.
150+ :::::{tab-set}
151+ :sync-group: install
105152
106- ** A3 series**
153+ ::::{tab-item} A3 series
154+ :sync: A3
107155
108156- Quantized model ` glm-5-w4a8 ` can be deployed on 1 Atlas 800 A3 (64G × 16) .
109157
110158Run the following script to execute online inference.
111159
112- ``` shell
160+ ``` {code-block} bash
161+ :substitutions:
113162export HCCL_OP_EXPANSION_MODE="AIV"
114163export OMP_PROC_BIND=false
115164export OMP_NUM_THREADS=10
@@ -140,6 +189,49 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w4a8 \
140189--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
141190```
142191
192+ ::::
193+ ::::{tab-item} A2 series
194+ :sync: A2
195+
196+ - Quantized model ` glm-5-w4a8 ` can be deployed on 1 Atlas 800 A2 (64G × 8) .
197+
198+ Run the following script to execute online inference.
199+
200+ ``` {code-block} bash
201+ :substitutions:
202+ export HCCL_OP_EXPANSION_MODE="AIV"
203+ export OMP_PROC_BIND=false
204+ export OMP_NUM_THREADS=10
205+ export VLLM_USE_V1=1
206+ export HCCL_BUFFSIZE=200
207+ export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
208+ export VLLM_ASCEND_BALANCE_SCHEDULING=1
209+
210+ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
211+ --host 0.0.0.0 \
212+ --port 8077 \
213+ --data-parallel-size 1 \
214+ --tensor-parallel-size 8 \
215+ --enable-expert-parallel \
216+ --seed 1024 \
217+ --served-model-name glm-5 \
218+ --max-num-seqs 2 \
219+ --max-model-len 32768 \
220+ --max-num-batched-tokens 4096 \
221+ --trust-remote-code \
222+ --gpu-memory-utilization 0.95 \
223+ --quantization ascend \
224+ --enable-chunked-prefill \
225+ --enable-prefix-caching \
226+ --async-scheduling \
227+ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
228+ --additional-config '{"multistream_overlap_shared_expert":true}' \
229+ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
230+ ```
231+
232+ ::::
233+ :::::
234+
143235** Notice:**
144236The parameters are explained as follows:
145237
@@ -148,19 +240,20 @@ The parameters are explained as follows:
148240
149241### Multi-node Deployment
150242
151- ** A2 series**
243+ :::::{tab-set}
244+ :sync-group: install
152245
153- Not test yet.
154-
155- ** A3 series**
246+ ::::{tab-item} A3 series
247+ :sync: A3
156248
157249- ` glm-5-bf16 ` : require at least 2 Atlas 800 A3 (64G × 16).
158250
159251Run the following scripts on two nodes respectively.
160252
161253** node 0**
162254
163- ``` shell
255+ ``` {code-block} bash
256+ :substitutions:
164257# this obtained through ifconfig
165258# nic_name is the network interface name corresponding to local_ip of the current node
166259nic_name="xxx"
@@ -204,7 +297,8 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \
204297
205298** node 1**
206299
207- ``` shell
300+ ``` {code-block} bash
301+ :substitutions:
208302# this obtained through ifconfig
209303# nic_name is the network interface name corresponding to local_ip of the current node
210304nic_name="xxx"
@@ -248,6 +342,111 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \
248342--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
249343```
250344
345+ ::::
346+ ::::{tab-item} A2 series
347+ :sync: A2
348+
349+ Run the following scripts on two nodes respectively.
350+
351+ ** node 0**
352+
353+ ``` {code-block} bash
354+ :substitutions:
355+ # this obtained through ifconfig
356+ # nic_name is the network interface name corresponding to local_ip of the current node
357+ nic_name="xxx"
358+ local_ip="xxx"
359+
360+ # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
361+ node0_ip="xxx"
362+
363+ export HCCL_OP_EXPANSION_MODE="AIV"
364+
365+ export HCCL_IF_IP=$local_ip
366+ export GLOO_SOCKET_IFNAME=$nic_name
367+ export TP_SOCKET_IFNAME=$nic_name
368+ export HCCL_SOCKET_IFNAME=$nic_name
369+ export OMP_PROC_BIND=false
370+ export OMP_NUM_THREADS=10
371+ export VLLM_USE_V1=1
372+ export HCCL_BUFFSIZE=200
373+ export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
374+
375+ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
376+ --host 0.0.0.0 \
377+ --port 8077 \
378+ --data-parallel-size 2 \
379+ --data-parallel-size-local 1 \
380+ --data-parallel-address $node0_ip \
381+ --data-parallel-rpc-port 13389 \
382+ --tensor-parallel-size 8 \
383+ --quantization ascend \
384+ --seed 1024 \
385+ --served-model-name glm-5 \
386+ --enable-expert-parallel \
387+ --max-num-seqs 2 \
388+ --max-model-len 131072 \
389+ --max-num-batched-tokens 4096 \
390+ --trust-remote-code \
391+ --no-enable-prefix-caching \
392+ --gpu-memory-utilization 0.95 \
393+ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
394+ --additional-config '{"multistream_overlap_shared_expert":true}' \
395+ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
396+ ```
397+
398+ ** node 1**
399+
400+ ``` {code-block} bash
401+ :substitutions:
402+ # this obtained through ifconfig
403+ # nic_name is the network interface name corresponding to local_ip of the current node
404+ nic_name="xxx"
405+ local_ip="xxx"
406+
407+ # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
408+ node0_ip="xxx"
409+
410+ export HCCL_OP_EXPANSION_MODE="AIV"
411+
412+ export HCCL_IF_IP=$local_ip
413+ export GLOO_SOCKET_IFNAME=$nic_name
414+ export TP_SOCKET_IFNAME=$nic_name
415+ export HCCL_SOCKET_IFNAME=$nic_name
416+ export OMP_PROC_BIND=false
417+ export OMP_NUM_THREADS=10
418+ export VLLM_USE_V1=1
419+ export HCCL_BUFFSIZE=200
420+ export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
421+
422+ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM-5-w4a8 \
423+ --host 0.0.0.0 \
424+ --port 8077 \
425+ --headless \
426+ --data-parallel-size 2 \
427+ --data-parallel-size-local 1 \
428+ --data-parallel-start-rank 1 \
429+ --data-parallel-address $node0_ip \
430+ --data-parallel-rpc-port 13389 \
431+ --tensor-parallel-size 8 \
432+ --quantization ascend \
433+ --seed 1024 \
434+ --served-model-name glm-5 \
435+ --enable-expert-parallel \
436+ --max-num-seqs 2 \
437+ --max-model-len 131072 \
438+ --max-num-batched-tokens 4096 \
439+ --trust-remote-code \
440+ --no-enable-prefix-caching \
441+ --gpu-memory-utilization 0.95 \
442+ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
443+ --additional-config '{"multistream_overlap_shared_expert":true}' \
444+ --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
445+ ```
446+
447+ ::::
448+ :::::
449+
251450### Prefill-Decode Disaggregation
252451
253452Not test yet.
0 commit comments