[0.9.1][doc]Update doc for 0.9.1 (#2648)

wangxiyuan · web-flow · commit 8d6e05bf8c1c · 2025-09-01T11:23:38.000+08:00
Refresh doc for 0.9.1 release

Signed-off-by: wangxiyuan &lt;wangxiyuan1007@gmail.com&gt;
diff --git a/docs/source/developer_guide/performance/optimization_and_tuning.md b/docs/source/developer_guide/performance/optimization_and_tuning.md
@@ -57,10 +57,10 @@ pip install modelscope pandas datasets gevent sacrebleu rouge_score pybind11 pyt
 VLLM_USE_MODELSCOPE=true
 ```
 
-Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) to make sure vllm, vllm-ascend and mindie-turbo is installed correctly.
+Please follow the [Installation Guide](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/installation.html) to make sure vllm, vllm-ascend is installed correctly.
 
 :::{note}
-Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend and mindie-turbo before chapter 1.1, the binary files will not use the optimized python.
+Make sure your vllm and vllm-ascend are installed after your python configuration completed, because these packages will build binary files using the python in current environment. If you install vllm, vllm-ascend before chapter 1.1, the binary files will not use the optimized python.
 :::
 
 ## Optimizations
diff --git a/docs/source/faqs.md b/docs/source/faqs.md
@@ -2,8 +2,7 @@
 
 ## Version Specific FAQs
 
-- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
-- [[v0.9.1rc3] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2410)
+- [[v0.9.1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2643)
 
 ## General FAQs
 
@@ -12,6 +11,7 @@
 Currently, **ONLY Atlas A2 series**  (Ascend-cann-kernels-910b) are supported:
 
 - Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
+- Atlas A3 Training series
 - Atlas 800I A2 Inference series (Atlas 800I A2)
 
 Below series are NOT supported yet:
@@ -29,13 +29,13 @@ If you are in China, you can use `daocloud` to accelerate your downloading:
 
 ```bash
 # Replace with tag you want to pull
-TAG=v0.7.3rc2
+TAG=v0.9.1
 docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
 ```
 
 ### 3. What models does vllm-ascend supports?
 
-Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_models.html).
+Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_models.html).
 
 ### 4. How to get in touch with our community?
 
@@ -48,7 +48,7 @@ There are many channels that you can communicate with our community developers /
 
 ### 5. What features does vllm-ascend V1 supports?
 
-Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html).
+Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_features.html).
 
 ### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
 
@@ -69,43 +69,39 @@ If all above steps are not working, feel free to submit a GitHub issue.
 
 ### 7. How does vllm-ascend perform?
 
-Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek  V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
+Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek  V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance.
 
 ### 8. How vllm-ascend work with vllm?
-vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
+vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
 
 ### 9. Does vllm-ascend support Prefill Disaggregation feature?
 
-Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
+Yes, Prefill Disaggregation feature is supported on V1 Engine for NPND support.
 
 ### 10. Does vllm-ascend support quantization method?
 
-Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
+w8a8 and w4a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, 
 
 ### 11. How to run w8a8 DeepSeek model?
 
-Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
+Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/tutorials/multi_node.html) and replace model to DeepSeek.
 
-### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
-
-If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
-
-### 13. How vllm-ascend is tested
+### 12. How vllm-ascend is tested
 
 vllm-ascend is tested by functional test, performance test and accuracy test.
 
-- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests，on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/support_matrix/supported_features.html) via e2e test
+- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests，on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/support_matrix/supported_features.html) via e2e test
 
 - **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website to show the performance test results for each pull request
 
 - **Accuracy test**: we're working on adding accuracy test to CI as well.
 
-Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
+Final, for each release, we'll publish the performance test and accuracy test report in the future.
 
-### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
+### 13. How to fix the error "InvalidVersion" when using vllm-ascend?
 It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
 
-### 15. How to handle Out Of Memory?
+### 14. How to handle Out Of Memory?
 OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
 
 In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
@@ -114,7 +110,7 @@ In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynam
 
 - **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
 
-### 16. Failed to enable NPU graph mode when running DeepSeek?
+### 15. Failed to enable NPU graph mode when running DeepSeek?
 You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
 
 And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
@@ -124,15 +120,18 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
 [rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
 ```
 
-### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
+### 16. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
 You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
 
-### 18. How to generate determinitic results when using vllm-ascend?
+### 17. How to generate determinitic results when using vllm-ascend?
 There are several factors that affect output certainty:
 
 1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
 
 ```python
+import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from vllm import LLM, SamplingParams
 
 prompts = [
@@ -164,11 +163,11 @@ export ATB_MATMUL_SHUFFLE_K_ENABLE=0
 export ATB_LLM_LCOC_ENABLE=0
 ```
 
-### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model？
+### 18. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model？
 The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
 this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
 
-### 20. Failed to run with `ray` distributed backend?
+### 19. Failed to run with `ray` distributed backend?
 You might facing the following errors when running with ray backend in distributed scenarios:
 
 ```
@@ -185,7 +184,7 @@ This has been solved in `ray>=2.47.1`, thus we could solve this as following:
 python3 -m pip install modelscope 'ray>=2.47.1' 'protobuf>3.20.0'
 ```
 
-### 21. Failed with inferencing Qwen3 MoE due to `Alloc sq cq fail` issue?
+### 20. Failed with inferencing Qwen3 MoE due to `Alloc sq cq fail` issue?
 
 When running Qwen3 MoE with tp/dp/ep, etc., you may encounter an error shown in [#2629](https://github.com/vllm-project/vllm-ascend/issues/2629).
 
diff --git a/docs/source/installation.md b/docs/source/installation.md
@@ -214,7 +214,7 @@ docker run --rm \
     -it $IMAGE bash
 ```
 
-The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
+The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/v0.9.1-dev/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
 ::::
 
 :::::
@@ -226,6 +226,9 @@ The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/v
 Create and run a simple inference test. The `example.py` can be like:
 
 ```python
+import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from vllm import LLM, SamplingParams
 
 prompts = [
diff --git a/docs/source/quick_start.md b/docs/source/quick_start.md
@@ -68,7 +68,7 @@ yum update -y && yum install -y curl
 ::::
 :::::
 
-The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
+The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/vllm-workspace` and installed in [development mode](https://setuptools.pypa.io/en/v0.9.1-dev/userguide/development_mode.html)(`pip install -e`) to help developer immediately take place changes without requiring a new installation.
 
 ## Usage
 
@@ -92,6 +92,9 @@ Try to run below Python script directly or use `python3` shell to generate texts
 <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
 
 ```python
+import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from vllm import LLM, SamplingParams
 
 prompts = [
diff --git a/docs/source/tutorials/multi_node.md b/docs/source/tutorials/multi_node.md
@@ -87,6 +87,7 @@ docker run --rm \
 -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 -v /etc/ascend_install.info:/etc/ascend_install.info \
 -v /mnt/sfs_turbo/.cache:/root/.cache \
+-e VLLM_USE_V1=1 \
 -it $IMAGE bash
 ```
 
@@ -115,7 +116,7 @@ export OMP_NUM_THREADS=100
 export HCCL_BUFFSIZE=1024
 
 # The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3-W8A8
-# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
+# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/feature_guide/quantization.html
 vllm serve /root/.cache/ds_v3 \
 --host 0.0.0.0 \
 --port 8004 \
diff --git a/docs/source/tutorials/multi_npu.md b/docs/source/tutorials/multi_npu.md
@@ -35,6 +35,9 @@ export VLLM_USE_MODELSCOPE=True
 
 # Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+
+# Enable V1 Engine
+export VLLM_USE_V1=1
 ```
 
 ### Online Inference on Multi-NPU
diff --git a/docs/source/tutorials/multi_npu_quantization.md b/docs/source/tutorials/multi_npu_quantization.md
@@ -1,10 +1,6 @@
 # Multi-NPU (QwQ 32B W8A8)
 
 ## Run docker container
-:::{note}
-w8a8 quantization feature is supported by v0.8.4rc2 or higher
-:::
-
 ```{code-block} bash
    :substitutions:
 # Update the vllm-ascend image
@@ -24,6 +20,7 @@ docker run --rm \
 -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 -v /etc/ascend_install.info:/etc/ascend_install.info \
 -v /root/.cache:/root/.cache \
+-e VLLM_USE_V1=1 \
 -p 8000:8000 \
 -it $IMAGE bash
 ```
@@ -70,10 +67,6 @@ The converted model files looks like:
 
 Run the following script to start the vLLM server with quantized model:
 
-:::{note}
-The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
-:::
-
 ```bash
 vllm serve /home/models/QwQ-32B-w8a8  --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
 ```
diff --git a/docs/source/tutorials/multi_npu_qwen3_moe.md b/docs/source/tutorials/multi_npu_qwen3_moe.md
@@ -35,6 +35,9 @@ export VLLM_USE_MODELSCOPE=True
 
 # Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
 export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+
+# Enable V1 Engine
+export VLLM_USE_V1=1
 ```
 
 ### Online Inference on Multi-NPU
@@ -44,7 +47,7 @@ Run the following script to start the vLLM server on Multi-NPU:
 For an Atlas A2 with 64GB of NPU card memory, tensor-parallel-size should be at least 2, and for 32GB of memory, tensor-parallel-size should be at least 4.
 
 ```bash
-vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --enable_expert_parallel
+vllm serve Qwen/Qwen3-30B-A3B --tensor-parallel-size 4
 ```
 
 Once your server is started, you can query the model with input prompts
diff --git a/docs/source/tutorials/single_npu.md b/docs/source/tutorials/single_npu.md
@@ -48,6 +48,8 @@ Run the following script to execute offline inference on a single NPU:
 ```{code-block} python
    :substitutions:
 import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from vllm import LLM, SamplingParams
 
 prompts = [
@@ -74,6 +76,8 @@ for output in outputs:
 ```{code-block} python
    :substitutions:
 import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from vllm import LLM, SamplingParams
 
 prompts = [
@@ -130,6 +134,7 @@ docker run --rm \
 -p 8000:8000 \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-e VLLM_USE_V1=1 \
 -it $IMAGE \
 vllm serve Qwen/Qwen3-8B --max_model_len 26240
 ```
@@ -156,6 +161,7 @@ docker run --rm \
 -p 8000:8000 \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-e VLLM_USE_V1=1 \
 -it $IMAGE \
 vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
 ```
diff --git a/docs/source/tutorials/single_npu_multimodal.md b/docs/source/tutorials/single_npu_multimodal.md
@@ -47,6 +47,9 @@ pip install torchvision==0.20.1 qwen_vl_utils --extra-index-url https://download
 ```
 
 ```python
+import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from transformers import AutoProcessor
 from vllm import LLM, SamplingParams
 from qwen_vl_utils import process_vision_info
@@ -141,6 +144,7 @@ docker run --rm \
 -p 8000:8000 \
 -e VLLM_USE_MODELSCOPE=True \
 -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-e VLLM_USE_V1=1 \
 -it $IMAGE \
 vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
 --dtype bfloat16 \
diff --git a/docs/source/user_guide/feature_guide/graph_mode.md b/docs/source/user_guide/feature_guide/graph_mode.md
@@ -21,6 +21,7 @@ offline example:
 
 ```python
 import os
+os.environ["VLLM_USE_V1"] = "1"
 
 from vllm import LLM
 
@@ -42,6 +43,8 @@ offline example:
 
 ```python
 import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from vllm import LLM
 
 # TorchAirGraph is only work without chunked-prefill now
@@ -52,7 +55,7 @@ outputs = model.generate("Hello, how are you?")
 online example:
 
 ```shell
-vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
+VLLM_USE_V1=1 vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
 ```
 
 You can find more detail about additional config [here](../configuration/additional_config.md).
@@ -65,6 +68,8 @@ offline example:
 
 ```python
 import os
+os.environ["VLLM_USE_V1"] = "1"
+
 from vllm import LLM
 
 model = LLM(model="someother_model_weight", enforce_eager=True)
diff --git a/docs/source/user_guide/feature_guide/lora.md b/docs/source/user_guide/feature_guide/lora.md
@@ -5,4 +5,4 @@ Like vLLM, vllm-ascend supports LoRA as well. The usage and more details can be
 You can also refer to [this](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) to find which models support LoRA in vLLM.
 
 ## Tips
-If you fail to run vllm-ascend with LoRA, you may follow [this instruction](https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html#fallback-to-eager-mode) to disable graph mode and try again.
+If you fail to run vllm-ascend with LoRA, you may follow [this instruction](https://vllm-ascend.readthedocs.io/en/v0.9.1-dev/user_guide/feature_guide/graph_mode.html#fallback-to-eager-mode) to disable graph mode and try again.
diff --git a/docs/source/user_guide/feature_guide/quantization.md b/docs/source/user_guide/feature_guide/quantization.md
@@ -65,6 +65,9 @@ Now, you can run the quantized models with vLLM Ascend. Here is the example for
 ### Offline inference
 
 ```python
+import os
+os.environ["VLLM_USE_V1"] = "1"
+
 import torch
 
 from vllm import LLM, SamplingParams
diff --git a/docs/source/user_guide/feature_guide/sleep_mode.md b/docs/source/user_guide/feature_guide/sleep_mode.md
diff --git a/docs/source/user_guide/feature_guide/structured_output.md b/docs/source/user_guide/feature_guide/structured_output.md
diff --git a/docs/source/user_guide/support_matrix/supported_features.md b/docs/source/user_guide/support_matrix/supported_features.md
diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md