Release v0.10.0rc1 · vllm-project/vllm-ascend

This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the official doc to get started. V0 is completely removed from this version.

Highlights

Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model #950, following this tutorial.
W4A8 quantization method is supported for dense and MoE model now. #2060 #2172

Core

Ascend PyTorch adapter (torch_npu) has been upgraded to 2.7.1.dev20250724. #1562 And CANN has been upgraded to 8.2.RC1. #1653 Don’t forget to update them in your environment or using the latest images.
vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. #1582
Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this tutorial to have a try. #2162
Pipeline Parallelism is supported in V1 now. #1800
Prefix cache feature now work with the Ascend Scheduler. #1446
Torchair graph mode works with tp > 4 now. #1508
MTP support torchair graph mode now #2145

Other

Bug fixes:
- Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. #1803
- Fix the process group creating error with external launch scenario. #1681
- Fix the functional problem with guided decoding. #2022
- Fix the accuracy issue with common MoE models in DP scenario. #1856
Performance improved through a lot of prs:
- Caching sin/cos instead of calculate it every layer. #1890
- Improve shared expert multi-stream parallelism #1891
- Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE to 1. #1926
- Optimize Quantized MoE Performance by Reducing All2All Communication. #2195
- Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance #1806
- Use multicast to avoid padding decode request to prefill size #1555
- The performance of LoRA has been improved. #1884
A batch of refactoring prs to enhance the code architecture:
- Torchair model runner refactor #2205
- Refactoring forward_context and model_runner_v1. #1979
- Refactor AscendMetaData Comments. #1967
- Refactor torchair utils. #1892
- Refactor torchair worker. #1885
- Register activation customop instead of overwrite forward_oot. #1841
Parameters changes:
- expert_tensor_parallel_size in additional_config is removed now, and the EP and TP is aligned with vLLM now. #1681
- Add VLLM_ASCEND_MLA_PA in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
- Add VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE in environ variables, enable MatmulAllReduce fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
- Add VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
UT coverage reached 76.34% after a batch of prs followed by this rfc: #1298
Sequence Parallelism works for Qwen3 MoE. #2209
Chinese online document is added now. #1870

Known Issues

Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. #2229
There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. #2232
In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. #2246
MTP not support V1 scheduler currently, will fix it in Q3. #2254
When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. #2254
GLM 4.5 model has accuracy problem in long output length scenario.

New Contributors

@pkking made their first contribution in #1792
@lianyiibo made their first contribution in #1811
@nuclearwu made their first contribution in #1867
@aidoczh made their first contribution in #1870
@shiyuan680 made their first contribution in #1930
@ZrBac made their first contribution in #1964
@Ronald1995 made their first contribution in #1988
@taoxudonghaha made their first contribution in #1884
@hongfugui made their first contribution in #1583
@YuanCheng-coder made their first contribution in #2067
@Liccol made their first contribution in #2127
@1024daniel made their first contribution in #2037
@yangqinghao-cmss made their first contribution in #2121

Full Changelog: v0.9.2rc1...v0.10.0rc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.10.0rc1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Core

Other

Known Issues

New Contributors

Contributors

Uh oh!