Skip to content

Commit 8d6e05b

Browse files
authored
[0.9.1][doc]Update doc for 0.9.1 (#2648)
Refresh doc for 0.9.1 release Signed-off-by: wangxiyuan <[email protected]>
1 parent 40c2c05 commit 8d6e05b

17 files changed

+74
-49
lines changed

docs/source/developer_guide/performance/optimization_and_tuning.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,10 +57,10 @@ pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pyt
5757
VLLM_USE_MODELSCOPE=true
5858
```
5959

60-
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
60+
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/installation.html) to make sure vllm, vllm-ascend is installed correctly.
6161

6262
:::{note}
63-
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
63+
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend before chapter 1.1, the binary files will not use the optimized python.
6464
:::
6565

6666
## Optimizations

docs/source/faqs.md

Lines changed: 24 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,7 @@
22

33
## Version Specific FAQs
44

5-
- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
6-
- [[v0.9.1rc3] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2410)
5+
- [[v0.9.1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2643)
76

87
## General FAQs
98

@@ -12,6 +11,7 @@
1211
Currently, **ONLY Atlas A2 series** (Ascend-cann-kernels-910b) are supported:
1312

1413
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
14+
- Atlas A3 Training series
1515
- Atlas 800I A2 Inference series (Atlas 800I A2)
1616

1717
Below series are NOT supported yet:
@@ -29,13 +29,13 @@ If you are in China, you can use `daocloud` to accelerate your downloading:
2929

3030
```bash
3131
# Replace with tag you want to pull
32-
TAG=v0.7.3rc2
32+
TAG=v0.9.1
3333
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
3434
```
3535

3636
### 3. What models does vllm-ascend supports?
3737

38-
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).
38+
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_models.html).
3939

4040
### 4. How to get in touch with our community?
4141

@@ -48,7 +48,7 @@ There are many channels that you can communicate with our community developers /
4848

4949
### 5. What features does vllm-ascend V1 supports?
5050

51-
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
51+
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_features.html).
5252

5353
### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
5454

@@ -69,43 +69,39 @@ If all above steps are not working, feel free to submit a GitHub issue.
6969

7070
### 7. How does vllm-ascend perform?
7171

72-
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
72+
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance.
7373

7474
### 8. How vllm-ascend work with vllm?
75-
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
75+
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
7676

7777
### 9. Does vllm-ascend support Prefill Disaggregation feature?
7878

79-
Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
79+
Yes, Prefill Disaggregation feature is supported on V1 Engine for NPND support.
8080

8181
### 10. Does vllm-ascend support quantization method?
8282

83-
Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
83+
w8a8 and w4a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher,
8484

8585
### 11. How to run w8a8 DeepSeek model?
8686

87-
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
87+
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/tutorials/multi_node.html) and replace model to DeepSeek.
8888

89-
### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
90-
91-
If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
92-
93-
### 13. How vllm-ascend is tested
89+
### 12. How vllm-ascend is tested
9490

9591
vllm-ascend is tested by functional test, performance test and accuracy test.
9692

97-
- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests,on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test
93+
- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests,on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_features.html) via e2e test
9894

9995
- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request
10096

10197
- **Accuracy test**: we're working on adding accuracy test to CI as well.
10298

103-
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
99+
Final, for each release, we'll publish the performance test and accuracy test report in the future.
104100

105-
### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
101+
### 13. How to fix the error "InvalidVersion" when using vllm-ascend?
106102
It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
107103

108-
### 15. How to handle Out Of Memory?
104+
### 14. How to handle Out Of Memory?
109105
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
110106

111107
In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
@@ -114,7 +110,7 @@ In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynam
114110

115111
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
116112

117-
### 16. Failed to enable NPU graph mode when running DeepSeek?
113+
### 15. Failed to enable NPU graph mode when running DeepSeek?
118114
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
119115

120116
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
@@ -124,15 +120,18 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
124120
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
125121
```
126122

127-
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
123+
### 16. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
128124
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
129125

130-
### 18. How to generate determinitic results when using vllm-ascend?
126+
### 17. How to generate determinitic results when using vllm-ascend?
131127
There are several factors that affect output certainty:
132128

133129
1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
134130

135131
```python
132+
import os
133+
os.environ["VLLM_USE_V1"] = "1"
134+
136135
from vllm import LLM, SamplingParams
137136

138137
prompts = [
@@ -164,11 +163,11 @@ export ATB_MATMUL_SHUFFLE_K_ENABLE=0
164163
export ATB_LLM_LCOC_ENABLE=0
165164
```
166165

167-
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
166+
### 18. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
168167
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
169168
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
170169

171-
### 20. Failed to run with `ray` distributed backend?
170+
### 19. Failed to run with `ray` distributed backend?
172171
You might facing the following errors when running with ray backend in distributed scenarios:
173172

174173
```
@@ -185,7 +184,7 @@ This has been solved in `ray>=2.47.1`, thus we could solve this as following:
185184
python3 -m pip install modelscope 'ray>=2.47.1' 'protobuf>3.20.0'
186185
```
187186

188-
### 21. Failed with inferencing Qwen3 MoE due to `Alloc sq cq fail` issue?
187+
### 20. Failed with inferencing Qwen3 MoE due to `Alloc sq cq fail` issue?
189188

190189
When running Qwen3 MoE with tp/dp/ep, etc., you may encounter an error shown in [#2629](https://github.com/vllm-project/vllm-ascend/issues/2629).
191190

docs/source/installation.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ docker run --rm \
214214
-it $IMAGE bash
215215
```
216216

217-
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
217+
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/v0.9.1-dev/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
218218
::::
219219

220220
:::::
@@ -226,6 +226,9 @@ The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/v
226226
Create and run a simple inference test. The `example.py` can be like:
227227

228228
```python
229+
import os
230+
os.environ["VLLM_USE_V1"] = "1"
231+
229232
from vllm import LLM, SamplingParams
230233

231234
prompts = [

docs/source/quick_start.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ yum update -y && yum install -y curl
6868
::::
6969
:::::
7070

71-
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
71+
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/v0.9.1-dev/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
7272

7373
## Usage
7474

@@ -92,6 +92,9 @@ Try to run below Python script directly or use `python3` shell to generate texts
9292
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
9393

9494
```python
95+
import os
96+
os.environ["VLLM_USE_V1"] = "1"
97+
9598
from vllm import LLM, SamplingParams
9699

97100
prompts = [

docs/source/tutorials/multi_node.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ docker run --rm \
8787
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
8888
-v /etc/ascend_install.info:/etc/ascend_install.info \
8989
-v /mnt/sfs_turbo/.cache:/root/.cache \
90+
-e VLLM_USE_V1=1 \
9091
-it $IMAGE bash
9192
```
9293

@@ -115,7 +116,7 @@ export OMP_NUM_THREADS=100
115116
export HCCL_BUFFSIZE=1024
116117

117118
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
118-
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
119+
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/feature_guide/quantization.html
119120
vllm serve /root/.cache/ds_v3 \
120121
--host 0.0.0.0 \
121122
--port 8004 \

docs/source/tutorials/multi_npu.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ export VLLM_USE_MODELSCOPE=True
3535

3636
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
3737
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
38+
39+
# Enable V1 Engine
40+
export VLLM_USE_V1=1
3841
```
3942

4043
### Online Inference on Multi-NPU

docs/source/tutorials/multi_npu_quantization.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
11
# Multi-NPU (QwQ 32B W8A8)
22

33
## Run docker container
4-
:::{note}
5-
w8a8 quantization feature is supported by v0.8.4rc2 or higher
6-
:::
7-
84
```{code-block} bash
95
:substitutions:
106
# Update the vllm-ascend image
@@ -24,6 +20,7 @@ docker run --rm \
2420
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
2521
-v /etc/ascend_install.info:/etc/ascend_install.info \
2622
-v /root/.cache:/root/.cache \
23+
-e VLLM_USE_V1=1 \
2724
-p 8000:8000 \
2825
-it $IMAGE bash
2926
```
@@ -70,10 +67,6 @@ The converted model files looks like:
7067

7168
Run the following script to start the vLLM server with quantized model:
7269

73-
:::{note}
74-
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
75-
:::
76-
7770
```bash
7871
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
7972
```

docs/source/tutorials/multi_npu_qwen3_moe.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ export VLLM_USE_MODELSCOPE=True
3535

3636
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
3737
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
38+
39+
# Enable V1 Engine
40+
export VLLM_USE_V1=1
3841
```
3942

4043
### Online Inference on Multi-NPU
@@ -44,7 +47,7 @@ Run the following script to start the vLLM server on Multi-NPU:
4447
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
4548

4649
```bash
47-
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
50+
vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4
4851
```
4952

5053
Once your server is started, you can query the model with input prompts

docs/source/tutorials/single_npu.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ Run the following script to execute offline inference on a single NPU:
4848
```{code-block} python
4949
:substitutions:
5050
import os
51+
os.environ["VLLM_USE_V1"] = "1"
52+
5153
from vllm import LLM, SamplingParams
5254
5355
prompts = [
@@ -74,6 +76,8 @@ for output in outputs:
7476
```{code-block} python
7577
:substitutions:
7678
import os
79+
os.environ["VLLM_USE_V1"] = "1"
80+
7781
from vllm import LLM, SamplingParams
7882
7983
prompts = [
@@ -130,6 +134,7 @@ docker run --rm \
130134
-p 8000:8000 \
131135
-e VLLM_USE_MODELSCOPE=True \
132136
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
137+
-e VLLM_USE_V1=1 \
133138
-it $IMAGE \
134139
vllm serve Qwen/Qwen3-8B --max_model_len 26240
135140
```
@@ -156,6 +161,7 @@ docker run --rm \
156161
-p 8000:8000 \
157162
-e VLLM_USE_MODELSCOPE=True \
158163
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
164+
-e VLLM_USE_V1=1 \
159165
-it $IMAGE \
160166
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
161167
```

docs/source/tutorials/single_npu_multimodal.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,9 @@ pip install torchvision==0.20.1 qwen_vl_utils --extra-index-url https://download
4747
```
4848

4949
```python
50+
import os
51+
os.environ["VLLM_USE_V1"] = "1"
52+
5053
from transformers import AutoProcessor
5154
from vllm import LLM, SamplingParams
5255
from qwen_vl_utils import process_vision_info
@@ -141,6 +144,7 @@ docker run --rm \
141144
-p 8000:8000 \
142145
-e VLLM_USE_MODELSCOPE=True \
143146
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
147+
-e VLLM_USE_V1=1 \
144148
-it $IMAGE \
145149
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
146150
--dtype bfloat16 \

0 commit comments

Comments
 (0)