v0.10.2rc1
Pre-release
Pre-release
This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the official guide to get start #2917
- Add quantization support for aclgraph #2841
Core
- Aclgraph now works with Ray backend. #2589
- MTP now works with the token > 1. #2708
- Qwen2.5 VL now works with quantization. #2778
- Improved the performance with async scheduler enabled. #2783
- Fixed the performance regression with non MLA model when use default scheduler. #2894
Other
- The performance of w8a8 quantization is improved. #2275
- The performance of moe model is improved. #2689 #2842
- Fixed resources limit error when apply speculative decoding and aclgraph. #2472
- Fixed the git config error in docker images. #2746
- Fixed the sliding windows attention bug with prefill. #2758
- The official doc for Prefill Decode Disaggregation with Qwen3 is added. #2751
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EPenv works again. #2740- A new improvement for oproj in deepseek is added. Set
oproj_tensor_parallel_sizeto enable this feature#2167 - Fix a bug that deepseek with torchair doesn't work as expect when
graph_batch_sizesis set. #2760 - Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. #2744
- The performance of Qwen3 dense model is improved with flashcomm_v1. Set
VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1andVLLM_ASCEND_ENABLE_FLASHCOMM=1to enable it. #2779 - The performance of Qwen3 dense model is improved with prefetch feature. Set
VLLM_ASCEND_ENABLE_PREFETCH_MLP=1to enable it. #2816 - The performance of Qwen3 MoE model is improved with rope ops update. #2571
- Fix the weight load error for RLHF case. #2756
- Add warm_up_atb step to speed up the inference. #2823
- Fixed the aclgraph steam error for moe model. #2827
Known issue
- The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by vLLM commit which is not included in v0.10.2. You can pick this commit to fix the issue.
- The HBM usage of Qwen3 Next is higher than expected. It's a known issue and we're working on it. You can set
max_model_lenandgpu_memory_utilizationto suitable value basing on your parallel config to avoid oom error. - We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. 2941
- Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. #2943
New Contributors
- @WithHades made their first contribution in #2589
- @vllm-ascend-ci made their first contribution in #2755
- @1092626063 made their first contribution in #2708
- @marcobarlo made their first contribution in #2039
- @realliujiaxu made their first contribution in #2719
- @machenglong2025 made their first contribution in #2805
- @fffrog made their first contribution in #2815
- @anon189Ty made their first contribution in #2619
- @zhaozx-cn made their first contribution in #2787
- @wenba0 made their first contribution in #2778
- @wuweiqiang24 made their first contribution in #2814
- @wyu0-0 made their first contribution in #2857
- @nwpu-zxr made their first contribution in #2824
Full Changelog: v0.10.1rc1...v0.10.2rc1