v0.11.0rc1
Pre-release
Pre-release
·
49 commits
to v0.11.0-dev
since this release
This is the first release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.
v0.11.0 will be the next official release version of vLLM Ascend. We'll release it in the next few days. Any feedback is welcome to help us to improve v0.11.0.
Highlights
- CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. #3945 #3896
- PrefixCache and Chunked Prefill are enabled by default. #3967
- W4A4 quantization is supported now. #3427 Official tutorial is available at here.
- The official documentation has now been switched to https://docs.vllm.ai/projects/ascend.
Core
- Performance of Qwen3 and Deepseek V3 series models are improved.
- Mooncake layerwise connector is supported now #2602. Find tutorial here.
- MTP > 1 is supported now. #2708
- [Experimental] Graph mode
FULL_DECODE_ONLYis supported now! AndFULLwill be landing in the next few weeks. #2128 - Pooling models, such as bge-m3, are supported now. #3171
Other
- Refactor the MOE module to make it clearer and easier to understand and the performance has improved in both quantitative and non-quantitative scenarios.
- Refactor model register module to make it easier to maintain. We'll remove this module in Q4 2025. #3004
- LLMDatadist KV Connector is deprecated. We'll remove it in Q1 2026.
- Refactor the linear module to support features flashcomm1 and flashcomm2 in paper flashcomm #3004 #3334
Known issue
- With PD disaggragate + fullgraph case, the memory may be leaked and the service may be stuck after long time serving. This is a bug from torch-npu, we'll upgrade and fix it soon.
- The accuracy of qwen2.5 VL is not very good with BF16 on videobench data collection. This is a bug lead by CANN, we'll fix it soon.
- For long sequence input case(>32k), there is no response sometimes and the kv cache usage is become higher. This is a bug from vLLM scheduler. We are working on it. Temporary solution is to set
max-model-lento a suitable value - Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set
--gpu-memory-utilizationto a suitable value, such as 0.8. - When running Qwen3-Next with expert parallel enabled, please set
HCCL_BUFFSIZEenvironment variable to a suitable value, such as 1024. - The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set
cudagraph_capture_sizesto a suitable value depending on the batch size for the input.
New Contributors
- @huangdong2022 made their first contribution in #3205
- @kiscad made their first contribution in #3226
- @dsxsteven made their first contribution in #3381
- @elilzhu made their first contribution in #3426
- @yuzhup made their first contribution in #3203
- @DreamerLeader made their first contribution in #3476
- @yechao237 made their first contribution in #3473
- @leijie-cn made their first contribution in #3519
- @Anionex made their first contribution in #3311
- @Semmer2 made their first contribution in #4041
Full Changelog: v0.11.0rc0...v0.11.0rc1