v0.10.0rc1
Pre-release
Pre-release
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the official doc to get started. V0 is completely removed from this version.
Highlights
- Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model #950, following this tutorial.
- W4A8 quantization method is supported for dense and MoE model now. #2060 #2172
Core
- Ascend PyTorch adapter (torch_npu) has been upgraded to
2.7.1.dev20250724. #1562 And CANN has been upgraded to8.2.RC1. #1653 Don’t forget to update them in your environment or using the latest images. - vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. #1582
- Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this tutorial to have a try. #2162
- Pipeline Parallelism is supported in V1 now. #1800
- Prefix cache feature now work with the Ascend Scheduler. #1446
- Torchair graph mode works with tp > 4 now. #1508
- MTP support torchair graph mode now #2145
Other
-
Bug fixes:
-
Performance improved through a lot of prs:
- Caching sin/cos instead of calculate it every layer. #1890
- Improve shared expert multi-stream parallelism #1891
- Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting
VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCEto1. #1926 - Optimize Quantized MoE Performance by Reducing All2All Communication. #2195
- Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance #1806
- Use multicast to avoid padding decode request to prefill size #1555
- The performance of LoRA has been improved. #1884
-
A batch of refactoring prs to enhance the code architecture:
-
Parameters changes:
expert_tensor_parallel_sizeinadditional_configis removed now, and the EP and TP is aligned with vLLM now. #1681- Add
VLLM_ASCEND_MLA_PAin environ variables, use this to enable mla paged attention operator for deepseek mla decode. - Add
VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCEin environ variables, enableMatmulAllReducefusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance. - Add
VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQin environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
-
UT coverage reached 76.34% after a batch of prs followed by this rfc: #1298
-
Sequence Parallelism works for Qwen3 MoE. #2209
-
Chinese online document is added now. #1870
Known Issues
- Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. #2229
- There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. #2232
- In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. #2246
- MTP not support V1 scheduler currently, will fix it in Q3. #2254
- When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. #2254
- GLM 4.5 model has accuracy problem in long output length scenario.
New Contributors
- @pkking made their first contribution in #1792
- @lianyiibo made their first contribution in #1811
- @nuclearwu made their first contribution in #1867
- @aidoczh made their first contribution in #1870
- @shiyuan680 made their first contribution in #1930
- @ZrBac made their first contribution in #1964
- @Ronald1995 made their first contribution in #1988
- @taoxudonghaha made their first contribution in #1884
- @hongfugui made their first contribution in #1583
- @YuanCheng-coder made their first contribution in #2067
- @Liccol made their first contribution in #2127
- @1024daniel made their first contribution in #2037
- @yangqinghao-cmss made their first contribution in #2121
Full Changelog: v0.9.2rc1...v0.10.0rc1