Skip to content

Releases: vllm-project/vllm-ascend

v0.10.1rc1

04 Sep 03:30
7e16b4a
Compare
Choose a tag to compare
v0.10.1rc1 Pre-release
Pre-release

This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the official doc to get started.

Highlights

  • LoRA Performance improved much through adding Custom Kernels by China Merchants Bank. #2325
  • Support Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. #1568
  • Support capture custom ops into aclgraph now. #2113

Core

  • Add MLP tensor parallel to improve performance, but note that this will increase memory usage. #2120
  • openEuler is upgraded to 24.03. #2631
  • Add custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. #2309
  • Qwen3 MoE/Qwen2.5 support torchair graph now. #2403
  • Support Sliding Window Attention with AscendSceduler, thus fixing Gemma3 accuracy issue. #2528

Other

  • Bug fixes:
    • Update the graph capture size calculation, somehow alleviated the problem that npu stream not enough in some scenarios #2511
    • Fix bugs and refactor cached mask generation logic. #2442
    • Fix the nz format does not work in quantization scenarios. #2549
    • Fix accuracy issue on Qwen series caused by enabling enable_shared_pert_dp by default. #2457
    • Fix accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. #2601
  • Performance improved through a lot of prs:
    • Remove torch.cat and replace it by List[0]. #2153
    • Convert the format of gmm to nz. #2474
    • Optimize parallel strategies to reduce communication overhead #2198
    • Optimize reject sampler in greedy situation #2137
  • A batch of refactoring prs to enhance the code architecture:
    • Refactor on MLA. #2465
    • Refactor on torchair fused_moe. #2438
    • Refactor on allgather/mc2-related fused_experts. #2369
    • Refactor on torchair model runner. #2208
    • Refactor on CI. #2276
  • Parameters changes:
    • Add lmhead_tensor_parallel_size in additional_config, set it to enable lmhead tensor parallel. #2309
    • Some unused environ variables HCCN_PATH, PROMPT_DEVICE_ID, DECODE_DEVICE_ID, LLMDATADIST_COMM_PORT and LLMDATADIST_SYNC_CACHE_WAIT_TIME are removed. #2448
    • Environ variable VLLM_LLMDD_RPC_PORT is renamed to VLLM_ASCEND_LLMDD_RPC_PORT now. #2450
    • Add VLLM_ASCEND_ENABLE_MLP_OPTIMIZE in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. #2120
    • Remove MOE_ALL2ALL_BUFFER and VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ in environ variables.#2612
    • Add enable_prefetch in additional_config, whether to enable weight prefetch. #2465
    • Add mode in additional_config.torchair_graph_config, When using reduce-overhead mode for torchair, mode needs to be set. #2461
    • enable_shared_expert_dp in additional_config is disabled by default now, and it is recommended to enable when inferencing with deepseek. #2457

Known Issues

  • Sliding window attention not support chunked prefill currently, thus we could only enable AscendScheduler to run with it. #2729
  • There is a bug with creating mc2_mask when MultiStream is enabled, will fix it in next release. #2681

New Contributors

Full Changelog: v0.10.0rc1...v0.10.1rc1

v0.9.1

03 Sep 10:05
0740d10
Compare
Choose a tag to compare

We are excited to announce the newest official release of vLLM Ascend. This release includes many feature supports, performance improvements and bug fixes. We recommend users to upgrade from 0.7.3 to this version. Please always set VLLM_USE_V1=1 to use V1 engine.

In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the official guide.

Please note that this release note will list all the important changes from last official release(v0.7.3)

Highlights

  • DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to muliti node tutorials and Large Scale Expert Parallelism.
  • Qwen series models work with graph mode now. It works by default with V1 Engine. Please refer to Qwen tutorials.
  • Disaggregated Prefilling support for V1 Engine. Please refer to Large Scale Expert Parallelism tutorials.
  • Automatic prefix caching and chunked prefill feature is supported.
  • Speculative decoding feature works with Ngram and MTP method.
  • MOE and dense w4a8 quantization support now. Please refer to quantization guide.
  • Sleep Mode feature is supported for V1 engine. Please refer to Sleep mode tutorials.
  • Dynamic and Static EPLB support is added. This feature is still experimental.

Note

The following notes are especially for reference when upgrading from last final release (v0.7.3):

  • V0 Engine is not supported from this release. Please always set VLLM_USE_V1=1 to use V1 engine with vLLM Ascend.
  • Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed.
  • Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them.

Core

  • The Ascend scheduler is added for V1 engine. This scheduler is more affine with Ascend hardware.
  • Structured output feature works now on V1 Engine.
  • A batch of custom ops are added to improve the performance.

Changes

  • EPLB support for Qwen3-moe model. #2000
  • Fix the bug that MTP doesn't work well with Prefill Decode Disaggregation. #2610 #2554 #2531
  • Fix few bugs to make sure Prefill Decode Disaggregation works well. #2538 #2509 #2502
  • Fix file not found error with shutil.rmtree in torchair mode. #2506

Known Issues

  • When running MoE model, Aclgraph mode only work with tensor parallel. DP/EP doesn't work in this release.
  • Pipeline parallelism is not supported in this release for V1 engine.
  • If you use w4a8 quantization with eager mode, please set VLLM_ASCEND_MLA_PARALLEL=1 to avoid oom error.
  • Accuracy test with some tools may not be correct. It doesn't affect the real user case. We'll fix it in the next post release. #2654
  • We notice that there are still some problems when running vLLM Ascend with Prefill Decode Disaggregation. For example, the memory may be leaked and the service may be stuck. It's caused by known issue by vLLM and vLLM Ascend. We'll fix it in the next post release. #2650 #2604 vLLM#22736 vLLM#23554 vLLM#23981

v0.9.1rc3

22 Aug 10:48
763ed69
Compare
Choose a tag to compare
v0.9.1rc3 Pre-release
Pre-release

This is the 3rd release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.

Core

  • MTP supports V1 scheduler #2371
  • Add LMhead TP communication groups #1956
  • Fix the bug that qwen3 moe doesn't work with aclgraph #2478
  • Fix grammar_bitmask IndexError caused by outdated apply_grammar_bitmask method #2314
  • Remove chunked_prefill_for_mla #2177
  • Fix bugs and refactor cached mask generation logic #2326
  • Fix configuration check logic about ascend scheduler #2327
  • Cancel the verification between deepseek-mtp and non-ascend scheduler in disaggregated-prefill deployment #2368
  • Fix issue that failed with ray distributed backend #2306
  • Fix incorrect req block length in ascend scheduler #2394
  • Fix header include issue in rope #2398
  • Fix mtp config bug #2412
  • Fix error info and adapt attn_metedata refactor #2402
  • Fix torchair runtime errror caused by configuration mismtaches and .kv_cache_bytes file missing #2312
  • Move with_prefill allreduce from cpu to npu #2230

Docs

  • Add document for deepseek large EP #2339

Known Issues

  • Full graph mode support are not yet available for some case with full_cuda_graph enable. #2182

Full Changelog: v0.9.1rc2...v0.9.1rc3

v0.10.0rc1

07 Aug 06:48
4604882
Compare
Choose a tag to compare
v0.10.0rc1 Pre-release
Pre-release

This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the official doc to get started. V0 is completely removed from this version.

Highlights

  • Disaggregate prefill works with V1 engine now. You can take a try with DeepSeek model #950, following this tutorial.
  • W4A8 quantization method is supported for dense and MoE model now. #2060 #2172

Core

  • Ascend PyTorch adapter (torch_npu) has been upgraded to 2.7.1.dev20250724. #1562 And CANN has been upgraded to 8.2.RC1. #1653 Don’t forget to update them in your environment or using the latest images.
  • vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. #1582
  • Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this tutorial to have a try. #2162
  • Pipeline Parallelism is supported in V1 now. #1800
  • Prefix cache feature now work with the Ascend Scheduler. #1446
  • Torchair graph mode works with tp > 4 now. #1508
  • MTP support torchair graph mode now #2145

Other

  • Bug fixes:

    • Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. #1803
    • Fix the process group creating error with external launch scenario. #1681
    • Fix the functional problem with guided decoding. #2022
    • Fix the accuracy issue with common MoE models in DP scenario. #1856
  • Performance improved through a lot of prs:

    • Caching sin/cos instead of calculate it every layer. #1890
    • Improve shared expert multi-stream parallelism #1891
    • Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE to 1. #1926
    • Optimize Quantized MoE Performance by Reducing All2All Communication. #2195
    • Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance #1806
    • Use multicast to avoid padding decode request to prefill size #1555
    • The performance of LoRA has been improved. #1884
  • A batch of refactoring prs to enhance the code architecture:

    • Torchair model runner refactor #2205
    • Refactoring forward_context and model_runner_v1. #1979
    • Refactor AscendMetaData Comments. #1967
    • Refactor torchair utils. #1892
    • Refactor torchair worker. #1885
    • Register activation customop instead of overwrite forward_oot. #1841
  • Parameters changes:

    • expert_tensor_parallel_size in additional_config is removed now, and the EP and TP is aligned with vLLM now. #1681
    • Add VLLM_ASCEND_MLA_PA in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
    • Add VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE in environ variables, enable MatmulAllReduce fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
    • Add VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
  • UT coverage reached 76.34% after a batch of prs followed by this rfc: #1298

  • Sequence Parallelism works for Qwen3 MoE. #2209

  • Chinese online document is added now. #1870

Known Issues

  • Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. #2229
  • There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. #2232
  • In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. #2246
  • MTP not support V1 scheduler currently, will fix it in Q3. #2254
  • When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. #2254
  • GLM 4.5 model has accuracy problem in long output length scenario.

New Contributors

Full Changelog: v0.9.2rc1...v0.10.0rc1

v0.9.1rc2

06 Aug 01:15
b9f715d
Compare
Choose a tag to compare
v0.9.1rc2 Pre-release
Pre-release

This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.

Highlights

  • MOE and dense w4a8 quantization support now: #1320 #1910 #1275 #1480
  • Dynamic EPLB support in #1943
  • Disaggregated Prefilling support for V1 Engine and improvement, continued development and stabilization of the disaggregated prefill feature, including performance enhancements and bug fixes for single-machine setups:#1953 #1612 #1361 #1746 #1552 #1801 #2083 #1989

Models improvement:

Graph mode improvement:

  • Fix DeepSeek with deepseek with mc2 in #1269
  • Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions in #1332
  • Fix torchair_graph_batch_sizes bug in #1570
  • Enable the limit of tp <= 4 for torchair graph mode in #1404
  • Fix rope accruracy bug #1887
  • Support multistream of shared experts in FusedMoE #997
  • Enable kvcache_nz for the decode process in torchair graph mode#1098
  • Fix chunked-prefill with torchair case to resolve UnboundLocalError: local variable 'decode_hs_or_q_c' issue in #1378
  • Improve shared experts multi-stream perf for w8a8 dynamic. in #1561
  • Repair moe error when set multistream. in #1882
  • Round up graph batch size to tp size in EP case #1610
  • Fix torchair bug when DP is enabled in #1727
  • Add extra checking to torchair_graph_config. in #1675
  • Fix rope bug in torchair+chunk-prefill scenario in #1693
  • torchair_graph bugfix when chunked_prefill is true in #1748
  • Improve prefill optimization to support torchair graph mode in #2090
  • Fix rank set in DP scenario #1247
  • Reset all unused positions to prevent out-of-bounds to resolve GatherV3 bug in #1397
  • Remove duplicate multimodal codes in ModelRunner in #1393
  • Fix block table shape to resolve accuracy issue in #1297
  • Implement primal full graph with limited scenario in #1503
  • Restore paged attention kernel in Full Graph for performance in #1677
  • Fix DeepSeek OOM issue in extreme --gpu-memory-utilization scenario in #1829
  • Turn off aclgraph when enabling TorchAir in #2154

Ops improvement:

  • add custom ascendc kernel vocabparallelembedding #796
  • fix rope sin/cos cache bug in #1267
  • Refactoring AscendFusedMoE (#1229) in #1264
  • Use fused ops npu_top_k_top_p in sampler #1920

Core:

  • Upgrade CANN to 8.2.rc1 in #2036
  • Upgrade torch-npu to 2.5.1.post1 in #2135
  • Upgrade python to 3.11 in #2136
  • Disable quantization in mindie_turbo in #1749
  • fix v0 spec decode in #1323
  • Enable ACL_OP_INIT_MODE=1 directly only when using V0 spec decode in #1271
  • Refactoring forward_context and model_runner_v1 in #1422
  • Fix sampling params in #1423
  • add a switch for enabling NZ layout in weights and enable NZ for GMM. in #1409
  • Resolved bug in ascend_forward_context in #1449 #1554 #1598
  • Address PrefillCacheHit state to fix prefix cache accuracy bug in #1492
  • Fix load weight error and add new e2e case in #1651
  • Optimize the number of rope-related index selections in deepseek. in #1614
  • add mc2 mask in #1642
  • Fix static EPLB log2phy condition and improve unit test in #1667 #1896 #2003
  • add chunk mc2 for prefill in #1703
  • Fix mc2 op GroupCoordinator bug in #1711
  • Fix the failure to recognize the actual type of quantization i...
Read more

v0.9.2rc1

11 Jul 09:51
b5b7e0e
Compare
Choose a tag to compare
v0.9.2rc1 Pre-release
Pre-release

This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the official doc to get started. From this release, V1 engine will be enabled by default, there is no need to set VLLM_USE_V1=1 any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.

Highlights

  • Pooling model works with V1 engine now. You can take a try with Qwen3 embedding model #1359.
  • The performance on Atlas 300I series has been improved. #1591
  • aclgraph mode works with Moe models now. Currently, only Qwen3 Moe is well tested. #1381

Core

  • Ascend PyTorch adapter (torch_npu) has been upgraded to 2.5.1.post1.dev20250619. Don’t forget to update it in your environment. #1347
  • The GatherV3 error has been fixed with aclgraph mode. #1416
  • W8A8 quantization works on Atlas 300I series now. #1560
  • Fix the accuracy problem with deploy models with parallel parameters. #1678
  • The pre-built wheel package now requires lower version of glibc. Users can use it by pip install vllm-ascend directly. #1582

Other

  • Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
  • Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. #1331
  • A new env variable VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is 0. #1335
  • A new env variable VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION has been added to improve the performance of topk-topp sampling. The default value is 0, we'll consider to enable it by default in the future#1732
  • A batch of bugs have been fixed for Data Parallelism case #1273 #1322 #1275 #1478
  • The DeepSeek performance has been improved. #1194 #1395 #1380
  • Ascend scheduler works with prefix cache now. #1446
  • DeepSeek now works with prefix cache now. #1498
  • Support prompt logprobs to recover ceval accuracy in V1 #1483

Knowissue

  • Pipeline parallel does not work with ray and graph mode: #1751 #1754

New Contributors

Full Changelog: v0.9.1rc1...v0.9.2rc1

v0.9.1rc1

22 Jun 07:08
c30ddb8
Compare
Choose a tag to compare
v0.9.1rc1 Pre-release
Pre-release

This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.

Experimental

  • Atlas 300I series is experimental supported in this release (Functional test passed with Qwen2.5-7b-instruct/Qwen2.5-0.5b/Qwen3-0.6B/Qwen3-4B/Qwen3-8B). #1333
  • Support EAGLE-3 for speculative decoding. #1032

After careful consideration, above features will NOT be included in v0.9.1-dev branch (v0.9.1 final release) taking into account the v0.9.1 release quality and the feature rapid iteration. We will improve this from 0.9.2rc1 and later.

Core

  • Ascend PyTorch adapter (torch_npu) has been upgraded to 2.5.1.post1.dev20250528. Don’t forget to update it in your environment. #1235
  • Support Atlas 300I series container image. You can get it from quay.io
  • Fix token-wise padding mechanism to make multi-card graph mode work. #1300
  • Upgrade vLLM to 0.9.1 [#1165]#1165

Other Improvements

  • Initial support Chunked Prefill for MLA. #1172
  • An example of best practices to run DeepSeek with ETP has been added. #1101
  • Performance improvements for DeepSeek using the TorchAir graph. #1098, #1131
  • Supports the speculative decoding feature with AscendScheduler. #943
  • Improve VocabParallelEmbedding custom op performance. It will be enabled in the next release. #796
  • Fixed a device discovery and setup bug when running vLLM Ascend on Ray #884
  • DeepSeek with MC2 (Merged Compute and Communication) now works properly. #1268
  • Fixed log2phy NoneType bug with static EPLB feature. #1186
  • Improved performance for DeepSeek with DBO enabled. #997, #1135
  • Refactoring AscendFusedMoE #1229
  • Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) #1224
  • Add unit test framework #1201

Known Issues

  • In some cases, the vLLM process may crash with a GatherV3 error when aclgraph is enabled. We are working on this issue and will fix it in the next release. #1038
  • Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. #1350

Full Changelog

v0.9.0rc2...v0.9.1rc1

New Contributors

Full Changelog: v0.9.0rc2...v0.9.1rc1

v0.9.0rc2

10 Jun 14:29
8dd686d
Compare
Choose a tag to compare
v0.9.0rc2 Pre-release
Pre-release

This is the 2nd official release candidate of v0.9.0 for vllm-ascend. Please follow the official doc to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment VLLM_USE_V1=1 to enable V1 Engine.

Highlights

  • DeepSeek works with graph mode now. Follow the official doc to take a try. #789
  • Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set enforce_eager=True when initializing the model.

Core

  • The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. #814
  • LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. #893
  • prefix cache and chunked prefill feature works now #782 #844
  • Spec decode and MTP features work with V1 Engine now. #874 #890
  • DP feature works with DeepSeek now. #1012
  • Input embedding feature works with V0 Engine now. #916
  • Sleep mode feature works with V1 Engine now. #1084

Model

  • Qwen2.5 VL works with V1 Engine now. #736
  • LLama4 works now. #740
  • A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set VLLM_ASCEND_ENABLE_DBO=1 to use it. #941

Other

Known Issue

  • In some case, vLLM process may be crashed with aclgraph enabled. We're working this issue and it'll be fixed in the next release. #1038
  • Multi node data-parallel doesn't work with this release. This is a known issue in vllm and has been fixed on main branch. #18981

New Contributors

v0.9.0rc1

10 Jun 01:17
706de02
Compare
Choose a tag to compare
v0.9.0rc1 Pre-release
Pre-release

Just a pre release for 0.9.0. There are still some known bug in this release

v0.7.3.post1

29 May 09:50
c69ceac
Compare
Choose a tag to compare

This is the first post release of 0.7.3. Please follow the official doc to start the journey. It includes the following changes:

Highlights

  • Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. #903 #915
  • Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. #878 Doc Link

Bug Fix

  • Qwen2.5-VL works for RLHF scenarios now. #928
  • Users can launch the model from online weights now. e.g. from huggingface or modelscope directly #858 #918
  • The meaningless log info UserWorkspaceSize0 has been cleaned. #911
  • The log level for Failed to import vllm_ascend_C has been changed to warning instead of error. #956
  • DeepSeek MLA now works with chunked prefill in V1 Engine. Please note that V1 engine in 0.7.3 is just expermential and only for test usage. #849 #936

Docs

  • The benchmark doc is updated for Qwen2.5 and Qwen2.5-VL #792
  • Add the note to clear that only "modelscope<1.23.0" works with 0.7.3. #954