Release v0.11.0rc1 · vllm-project/vllm-ascend

This is the first release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.
v0.11.0 will be the next official release version of vLLM Ascend. We'll release it in the next few days. Any feedback is welcome to help us to improve v0.11.0.

Highlights

CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. #3945 #3896
PrefixCache and Chunked Prefill are enabled by default. #3967
W4A4 quantization is supported now. #3427 Official tutorial is available at here.
The official documentation has now been switched to https://docs.vllm.ai/projects/ascend.

Core

Performance of Qwen3 and Deepseek V3 series models are improved.
Mooncake layerwise connector is supported now #2602. Find tutorial here.
MTP > 1 is supported now. #2708
[Experimental] Graph mode FULL_DECODE_ONLY is supported now! And FULL will be landing in the next few weeks. #2128
Pooling models, such as bge-m3, are supported now. #3171

Other

Refactor the MOE module to make it clearer and easier to understand and the performance has improved in both quantitative and non-quantitative scenarios.
Refactor model register module to make it easier to maintain. We'll remove this module in Q4 2025. #3004
LLMDatadist KV Connector is deprecated. We'll remove it in Q1 2026.
Refactor the linear module to support features flashcomm1 and flashcomm2 in paper flashcomm #3004 #3334

Known issue

With PD disaggragate + fullgraph case, the memory may be leaked and the service may be stuck after long time serving. This is a bug from torch-npu, we'll upgrade and fix it soon.
The accuracy of qwen2.5 VL is not very good with BF16 on videobench data collection. This is a bug lead by CANN, we'll fix it soon.
For long sequence input case(>32k), there is no response sometimes and the kv cache usage is become higher. This is a bug from vLLM scheduler. We are working on it. Temporary solution is to set max-model-len to a suitable value
Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set --gpu-memory-utilization to a suitable value, such as 0.8.
When running Qwen3-Next with expert parallel enabled, please set HCCL_BUFFSIZE environment variable to a suitable value, such as 1024.
The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set cudagraph_capture_sizes to a suitable value depending on the batch size for the input.

New Contributors

@huangdong2022 made their first contribution in #3205
@kiscad made their first contribution in #3226
@dsxsteven made their first contribution in #3381
@elilzhu made their first contribution in #3426
@yuzhup made their first contribution in #3203
@DreamerLeader made their first contribution in #3476
@yechao237 made their first contribution in #3473
@leijie-cn made their first contribution in #3519
@Anionex made their first contribution in #3311
@Semmer2 made their first contribution in #4041

Full Changelog: v0.11.0rc0...v0.11.0rc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.11.0rc1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Core

Other

Known issue

New Contributors

Contributors

Uh oh!