You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
60
+
Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/installation.html) to make sure vllm, vllm-ascend is installed correctly.
61
61
62
62
:::{note}
63
-
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
63
+
Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend before chapter 1.1, the binary files will not use the optimized python.
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).
38
+
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_models.html).
39
39
40
40
### 4. How to get in touch with our community?
41
41
@@ -48,7 +48,7 @@ There are many channels that you can communicate with our community developers /
48
48
49
49
### 5. What features does vllm-ascend V1 supports?
50
50
51
-
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
51
+
Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_features.html).
52
52
53
53
### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
54
54
@@ -69,43 +69,39 @@ If all above steps are not working, feel free to submit a GitHub issue.
69
69
70
70
### 7. How does vllm-ascend perform?
71
71
72
-
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
72
+
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance.
73
73
74
74
### 8. How vllm-ascend work with vllm?
75
-
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
75
+
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
76
76
77
77
### 9. Does vllm-ascend support Prefill Disaggregation feature?
78
78
79
-
Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
79
+
Yes, Prefill Disaggregation feature is supported on V1 Engine for NPND support.
80
80
81
81
### 10. Does vllm-ascend support quantization method?
82
82
83
-
Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
83
+
w8a8 and w4a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher,
84
84
85
85
### 11. How to run w8a8 DeepSeek model?
86
86
87
-
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
87
+
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/tutorials/multi_node.html) and replace model to DeepSeek.
88
88
89
-
### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
90
-
91
-
If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
92
-
93
-
### 13. How vllm-ascend is tested
89
+
### 12. How vllm-ascend is tested
94
90
95
91
vllm-ascend is tested by functional test, performance test and accuracy test.
96
92
97
-
-**Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests,on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test
93
+
-**Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests,on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_features.html) via e2e test
98
94
99
95
-**Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request
100
96
101
97
-**Accuracy test**: we're working on adding accuracy test to CI as well.
102
98
103
-
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
99
+
Final, for each release, we'll publish the performance test and accuracy test report in the future.
104
100
105
-
### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
101
+
### 13. How to fix the error "InvalidVersion" when using vllm-ascend?
106
102
It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
107
103
108
-
### 15. How to handle Out Of Memory?
104
+
### 14. How to handle Out Of Memory?
109
105
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
110
106
111
107
In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
@@ -114,7 +110,7 @@ In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynam
114
110
115
111
-**Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
116
112
117
-
### 16. Failed to enable NPU graph mode when running DeepSeek?
113
+
### 15. Failed to enable NPU graph mode when running DeepSeek?
118
114
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
119
115
120
116
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
@@ -124,15 +120,18 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
124
120
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
125
121
```
126
122
127
-
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
123
+
### 16. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
128
124
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
129
125
130
-
### 18. How to generate determinitic results when using vllm-ascend?
126
+
### 17. How to generate determinitic results when using vllm-ascend?
131
127
There are several factors that affect output certainty:
132
128
133
129
1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
166
+
### 18. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
168
167
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
169
168
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
170
169
171
-
### 20. Failed to run with `ray` distributed backend?
170
+
### 19. Failed to run with `ray` distributed backend?
172
171
You might facing the following errors when running with ray backend in distributed scenarios:
173
172
174
173
```
@@ -185,7 +184,7 @@ This has been solved in `ray>=2.47.1`, thus we could solve this as following:
Copy file name to clipboardExpand all lines: docs/source/installation.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -214,7 +214,7 @@ docker run --rm \
214
214
-it $IMAGE bash
215
215
```
216
216
217
-
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
217
+
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/v0.9.1-dev/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
218
218
::::
219
219
220
220
:::::
@@ -226,6 +226,9 @@ The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/v
226
226
Create and run a simple inference test. The `example.py` can be like:
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
71
+
The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/v0.9.1-dev/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
72
72
73
73
## Usage
74
74
@@ -92,6 +92,9 @@ Try to run below Python script directly or use `python3` shell to generate texts
92
92
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
118
-
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
119
+
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/feature_guide/quantization.html
@@ -70,10 +67,6 @@ The converted model files looks like:
70
67
71
68
Run the following script to start the vLLM server with quantized model:
72
69
73
-
:::{note}
74
-
The value "ascend"for"--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
@@ -44,7 +47,7 @@ Run the following script to start the vLLM server on Multi-NPU:
44
47
For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
0 commit comments