Release v0.10.2rc1 · vllm-project/vllm-ascend

This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the official doc to get started.

Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the official guide to get start #2917
Add quantization support for aclgraph #2841

Aclgraph now works with Ray backend. #2589
MTP now works with the token > 1. #2708
Qwen2.5 VL now works with quantization. #2778
Improved the performance with async scheduler enabled. #2783
Fixed the performance regression with non MLA model when use default scheduler. #2894

The performance of w8a8 quantization is improved. #2275
The performance of moe model is improved. #2689 #2842
Fixed resources limit error when apply speculative decoding and aclgraph. #2472
Fixed the git config error in docker images. #2746
Fixed the sliding windows attention bug with prefill. #2758
The official doc for Prefill Decode Disaggregation with Qwen3 is added. #2751
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP env works again. #2740
A new improvement for oproj in deepseek is added. Set oproj_tensor_parallel_size to enable this feature#2167
Fix a bug that deepseek with torchair doesn't work as expect when graph_batch_sizes is set. #2760
Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. #2744
The performance of Qwen3 dense model is improved with flashcomm_v1. Set VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1 and VLLM_ASCEND_ENABLE_FLASHCOMM=1 to enable it. #2779
The performance of Qwen3 dense model is improved with prefetch feature. Set VLLM_ASCEND_ENABLE_PREFETCH_MLP=1 to enable it. #2816
The performance of Qwen3 MoE model is improved with rope ops update. #2571
Fix the weight load error for RLHF case. #2756
Add warm_up_atb step to speed up the inference. #2823
Fixed the aclgraph steam error for moe model. #2827

The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by vLLM commit which is not included in v0.10.2. You can pick this commit to fix the issue.
The HBM usage of Qwen3 Next is higher than expected. It's a known issue and we're working on it. You can set max_model_len and gpu_memory_utilization to suitable value basing on your parallel config to avoid oom error.
We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. 2941
Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. #2943

New Contributors

Full Changelog: v0.10.1rc1...v0.10.2rc1