Skip to content

v0.11.0rc1

Pre-release
Pre-release

Choose a tag to compare

@wangxiyuan wangxiyuan released this 10 Nov 13:02
· 49 commits to v0.11.0-dev since this release
c5fe179

This is the first release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.
v0.11.0 will be the next official release version of vLLM Ascend. We'll release it in the next few days. Any feedback is welcome to help us to improve v0.11.0.

Highlights

  • CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. #3945 #3896
  • PrefixCache and Chunked Prefill are enabled by default. #3967
  • W4A4 quantization is supported now. #3427 Official tutorial is available at here.
  • The official documentation has now been switched to https://docs.vllm.ai/projects/ascend.

Core

  • Performance of Qwen3 and Deepseek V3 series models are improved.
  • Mooncake layerwise connector is supported now #2602. Find tutorial here.
  • MTP > 1 is supported now. #2708
  • [Experimental] Graph mode FULL_DECODE_ONLY is supported now! And FULL will be landing in the next few weeks. #2128
  • Pooling models, such as bge-m3, are supported now. #3171

Other

  • Refactor the MOE module to make it clearer and easier to understand and the performance has improved in both quantitative and non-quantitative scenarios.
  • Refactor model register module to make it easier to maintain. We'll remove this module in Q4 2025. #3004
  • LLMDatadist KV Connector is deprecated. We'll remove it in Q1 2026.
  • Refactor the linear module to support features flashcomm1 and flashcomm2 in paper flashcomm #3004 #3334

Known issue

  • With PD disaggragate + fullgraph case, the memory may be leaked and the service may be stuck after long time serving. This is a bug from torch-npu, we'll upgrade and fix it soon.
  • The accuracy of qwen2.5 VL is not very good with BF16 on videobench data collection. This is a bug lead by CANN, we'll fix it soon.
  • For long sequence input case(>32k), there is no response sometimes and the kv cache usage is become higher. This is a bug from vLLM scheduler. We are working on it. Temporary solution is to set max-model-len to a suitable value
  • Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set --gpu-memory-utilization to a suitable value, such as 0.8.
  • When running Qwen3-Next with expert parallel enabled, please set HCCL_BUFFSIZE environment variable to a suitable value, such as 1024.
  • The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set cudagraph_capture_sizes to a suitable value depending on the batch size for the input.

New Contributors

Full Changelog: v0.11.0rc0...v0.11.0rc1