Skip to content

Commit b5ccef6

Browse files
authored
[Doc] Add doc for Qwen3 Next (#2916)
### What this PR does / why we need it? Add doc for Qwen3 Next ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc CI passed Related: #2884 - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@01413e0 Signed-off-by: Yikun Jiang <[email protected]>
1 parent aa3c456 commit b5ccef6

File tree

2 files changed

+157
-0
lines changed

2 files changed

+157
-0
lines changed

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ single_npu_multimodal
88
single_npu_audio
99
single_npu_qwen3_embedding
1010
single_npu_qwen3_quantization
11+
multi_npu_qwen3_next
1112
multi_npu
1213
multi_npu_moge
1314
multi_npu_qwen3_moe
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Multi-NPU (Qwen3-Next)
2+
3+
```{note}
4+
The Qwen3 Next are using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes around stability, accuracy and performance improvement.
5+
```
6+
7+
## Run vllm-ascend on Multi-NPU with Qwen3 Next
8+
9+
Run docker container:
10+
11+
```{code-block} bash
12+
:substitutions:
13+
# Update the vllm-ascend image
14+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
15+
docker run --rm \
16+
--name vllm-ascend-qwen3 \
17+
--device /dev/davinci0 \
18+
--device /dev/davinci1 \
19+
--device /dev/davinci2 \
20+
--device /dev/davinci3 \
21+
--device /dev/davinci_manager \
22+
--device /dev/devmm_svm \
23+
--device /dev/hisi_hdc \
24+
-v /usr/local/dcmi:/usr/local/dcmi \
25+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
26+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
27+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
28+
-v /etc/ascend_install.info:/etc/ascend_install.info \
29+
-v /root/.cache:/root/.cache \
30+
-p 8000:8000 \
31+
-it $IMAGE bash
32+
```
33+
34+
Setup environment variables:
35+
36+
```bash
37+
# Load model from ModelScope to speed up download
38+
export VLLM_USE_MODELSCOPE=True
39+
```
40+
41+
### Install Triton Ascend
42+
43+
:::::{tab-set}
44+
::::{tab-item} Linux (aarch64)
45+
46+
The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency.
47+
48+
Install the Ascend BiSheng toolkit:
49+
50+
```bash
51+
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run
52+
chmod a+x Ascend-BiSheng-toolkit_aarch64.run
53+
./Ascend-BiSheng-toolkit_aarch64.run --install
54+
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
55+
```
56+
57+
Install Triton Ascend:
58+
59+
```bash
60+
wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
61+
pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl
62+
```
63+
64+
::::
65+
66+
::::{tab-item} Linux (x86_64)
67+
68+
Coming soon ...
69+
70+
::::
71+
:::::
72+
73+
### Inference on Multi-NPU
74+
75+
Please make sure you already executed the command:
76+
77+
```bash
78+
source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh
79+
```
80+
81+
:::::{tab-set}
82+
::::{tab-item} Online Inference
83+
84+
Run the following script to start the vLLM server on Multi-NPU:
85+
86+
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32GB of memory, tensor-parallel-size should be at least 8.
87+
88+
```bash
89+
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --enforce-eager
90+
```
91+
92+
Once your server is started, you can query the model with input prompts
93+
94+
```bash
95+
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
96+
"model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
97+
"messages": [
98+
{"role": "user", "content": "Give me a short introduction to large language models."}
99+
],
100+
"temperature": 0.6,
101+
"top_p": 0.95,
102+
"top_k": 20,
103+
"max_tokens": 4096
104+
}'
105+
```
106+
107+
::::
108+
109+
::::{tab-item} Offline Inference
110+
111+
Run the following script to execute offline inference on multi-NPU:
112+
113+
```python
114+
import gc
115+
import torch
116+
117+
from vllm import LLM, SamplingParams
118+
from vllm.distributed.parallel_state import (destroy_distributed_environment,
119+
destroy_model_parallel)
120+
121+
def clean_up():
122+
destroy_model_parallel()
123+
destroy_distributed_environment()
124+
gc.collect()
125+
torch.npu.empty_cache()
126+
127+
if __name__ == '__main__':
128+
prompts = [
129+
"Who are you?",
130+
]
131+
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
132+
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
133+
tensor_parallel_size=4,
134+
enforce_eager=True,
135+
distributed_executor_backend="mp",
136+
gpu_memory_utilization=0.7,
137+
max_model_len=4096)
138+
139+
outputs = llm.generate(prompts, sampling_params)
140+
for output in outputs:
141+
prompt = output.prompt
142+
generated_text = output.outputs[0].text
143+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
144+
145+
del llm
146+
clean_up()
147+
```
148+
149+
If you run this script successfully, you can see the info shown below:
150+
151+
```bash
152+
Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am'
153+
```
154+
155+
::::
156+
:::::

0 commit comments

Comments
 (0)