Skip to content

v0.10.0rc1

Pre-release
Pre-release

Choose a tag to compare

@wangxiyuan wangxiyuan released this 07 Aug 06:48
· 891 commits to main since this release
4604882

This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the official doc to get started. V0 is completely removed from this version.

Highlights

  • Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model #950, following this tutorial.
  • W4A8 quantization method is supported for dense and MoE model now. #2060 #2172

Core

  • Ascend PyTorch adapter (torch_npu) has been upgraded to 2.7.1.dev20250724. #1562 And CANN has been upgraded to 8.2.RC1. #1653 Don’t forget to update them in your environment or using the latest images.
  • vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. #1582
  • Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this tutorial to have a try. #2162
  • Pipeline Parallelism is supported in V1 now. #1800
  • Prefix cache feature now work with the Ascend Scheduler. #1446
  • Torchair graph mode works with tp > 4 now. #1508
  • MTP support torchair graph mode now #2145

Other

  • Bug fixes:

    • Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. #1803
    • Fix the process group creating error with external launch scenario. #1681
    • Fix the functional problem with guided decoding. #2022
    • Fix the accuracy issue with common MoE models in DP scenario. #1856
  • Performance improved through a lot of prs:

    • Caching sin/cos instead of calculate it every layer. #1890
    • Improve shared expert multi-stream parallelism #1891
    • Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE to 1. #1926
    • Optimize Quantized MoE Performance by Reducing All2All Communication. #2195
    • Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance #1806
    • Use multicast to avoid padding decode request to prefill size #1555
    • The performance of LoRA has been improved. #1884
  • A batch of refactoring prs to enhance the code architecture:

    • Torchair model runner refactor #2205
    • Refactoring forward_context and model_runner_v1. #1979
    • Refactor AscendMetaData Comments. #1967
    • Refactor torchair utils. #1892
    • Refactor torchair worker. #1885
    • Register activation customop instead of overwrite forward_oot. #1841
  • Parameters changes:

    • expert_tensor_parallel_size in additional_config is removed now, and the EP and TP is aligned with vLLM now. #1681
    • Add VLLM_ASCEND_MLA_PA in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
    • Add VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE in environ variables, enable MatmulAllReduce fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
    • Add VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
  • UT coverage reached 76.34% after a batch of prs followed by this rfc: #1298

  • Sequence Parallelism works for Qwen3 MoE. #2209

  • Chinese online document is added now. #1870

Known Issues

  • Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. #2229
  • There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. #2232
  • In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. #2246
  • MTP not support V1 scheduler currently, will fix it in Q3. #2254
  • When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. #2254
  • GLM 4.5 model has accuracy problem in long output length scenario.

New Contributors

Full Changelog: v0.9.2rc1...v0.10.0rc1