Skip to content

Commit 440d28a

Browse files
authored
[Tutorial] Add qwen3 8b w4a8 tutorial (#2249)
### What this PR does / why we need it? Add a new single npu quantization tutorial, and using the latest qwen3 model. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@8e8e0b6 Signed-off-by: 22dimensions <[email protected]>
1 parent bcd0b53 commit 440d28a

File tree

2 files changed

+132
-0
lines changed

2 files changed

+132
-0
lines changed

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ single_npu
77
single_npu_multimodal
88
single_npu_audio
99
single_npu_qwen3_embedding
10+
single_npu_qwen3_quantization
1011
multi_npu
1112
multi_npu_moge
1213
multi_npu_qwen3_moe
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Single-NPU (Qwen3 8B W4A8)
2+
3+
## Run docker container
4+
:::{note}
5+
w4a8 quantization feature is supported by v0.9.1rc2 or higher
6+
:::
7+
8+
```{code-block} bash
9+
:substitutions:
10+
# Update the vllm-ascend image
11+
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
12+
docker run --rm \
13+
--name vllm-ascend \
14+
--device /dev/davinci0 \
15+
--device /dev/davinci_manager \
16+
--device /dev/devmm_svm \
17+
--device /dev/hisi_hdc \
18+
-v /usr/local/dcmi:/usr/local/dcmi \
19+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
20+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
21+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
22+
-v /etc/ascend_install.info:/etc/ascend_install.info \
23+
-v /root/.cache:/root/.cache \
24+
-p 8000:8000 \
25+
-it $IMAGE bash
26+
```
27+
28+
## Install modelslim and convert model
29+
:::{note}
30+
You can choose to convert the model yourself or use the quantized model we uploaded,
31+
see https://www.modelscope.cn/models/vllm-ascend/Qwen3-8B-W4A8
32+
:::
33+
34+
```bash
35+
# Optional, this commit has been verified
36+
git clone https://gitee.com/ascend/msit -b f8ab35a772a6c1ee7675368a2aa4bafba3bedd1a
37+
38+
cd msit/msmodelslim
39+
# Install by run this script
40+
bash install.sh
41+
42+
cd example/Qwen
43+
# Original weight path, Replace with your local model path
44+
MODEL_PATH=/home/models/Qwen3-8B
45+
# Path to save converted weight, Replace with your local path
46+
SAVE_PATH=/home/models/Qwen3-8B-w4a8
47+
48+
python quant_qwen.py \
49+
--model_path $MODEL_PATH \
50+
--save_directory $SAVE_PATH \
51+
--device_type npu \
52+
--model_type qwen3 \
53+
--calib_file None \
54+
--anti_method m6 \
55+
--anti_calib_file ./calib_data/mix_dataset.json \
56+
--w_bit 4 \
57+
--a_bit 8 \
58+
--is_lowbit True \
59+
--open_outlier False \
60+
--group_size 256 \
61+
--is_dynamic True \
62+
--trust_remote_code True \
63+
--w_method HQQ
64+
```
65+
66+
## Verify the quantized model
67+
The converted model files looks like:
68+
69+
```bash
70+
.
71+
|-- config.json
72+
|-- configuration.json
73+
|-- generation_config.json
74+
|-- merges.txt
75+
|-- quant_model_description.json
76+
|-- quant_model_weight_w4a8_dynamic-00001-of-00003.safetensors
77+
|-- quant_model_weight_w4a8_dynamic-00002-of-00003.safetensors
78+
|-- quant_model_weight_w4a8_dynamic-00003-of-00003.safetensors
79+
|-- quant_model_weight_w4a8_dynamic.safetensors.index.json
80+
|-- README.md
81+
|-- tokenizer.json
82+
`-- tokenizer_config.json
83+
```
84+
85+
Run the following script to start the vLLM server with quantized model:
86+
87+
```bash
88+
vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend
89+
```
90+
91+
Once your server is started, you can query the model with input prompts
92+
93+
```bash
94+
curl http://localhost:8000/v1/completions \
95+
-H "Content-Type: application/json" \
96+
-d '{
97+
"model": "qwen3-8b-w4a8",
98+
"prompt": "what is large language model?",
99+
"max_tokens": "128",
100+
"top_p": "0.95",
101+
"top_k": "40",
102+
"temperature": "0.0"
103+
}'
104+
```
105+
106+
Run the following script to execute offline inference on Single-NPU with quantized model:
107+
108+
:::{note}
109+
To enable quantization for ascend, quantization method must be "ascend"
110+
:::
111+
112+
```python
113+
114+
from vllm import LLM, SamplingParams
115+
116+
prompts = [
117+
"Hello, my name is",
118+
"The future of AI is",
119+
]
120+
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
121+
122+
llm = LLM(model="/home/models/Qwen3-8B-w4a8",
123+
max_model_len=4096,
124+
quantization="ascend")
125+
126+
outputs = llm.generate(prompts, sampling_params)
127+
for output in outputs:
128+
prompt = output.prompt
129+
generated_text = output.outputs[0].text
130+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
131+
```

0 commit comments

Comments
 (0)