Skip to content

Commit 1b5d5ab

Browse files
authored
[ReleaseNote] Add release note for v0.13.0rc1 (#5334)
### What this PR does / why we need it? Add release note for v0.13.0rc1 - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@bc0a5a0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>
1 parent 58adf7c commit 1b5d5ab

File tree

6 files changed

+69
-8
lines changed

6 files changed

+69
-8
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Please use the following recommended versions to get started quickly:
5353

5454
| Version | Release type | Doc |
5555
|------------|--------------|--------------------------------------|
56-
|v0.12.0rc1|Latest release candidate|[QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details|
56+
|v0.13.0rc1|Latest release candidate|[QuickStart](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more details|
5757
|v0.11.0|Latest stable version|[QuickStart](https://vllm-ascend.readthedocs.io/en/v0.11.0-dev/quick_start.html) and [Installation](https://vllm-ascend.readthedocs.io/en/v0.11.0-dev/installation.html) for more details|
5858

5959
## Contributing

README.zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
5454

5555
| Version | Release type | Doc |
5656
|------------|--------------|--------------------------------------|
57-
|v0.12.0rc1| 最新RC版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多|
57+
|v0.13.0rc1| 最新RC版本 |请查看[快速开始](https://vllm-ascend.readthedocs.io/en/latest/quick_start.html)[安装指南](https://vllm-ascend.readthedocs.io/en/latest/installation.html)了解更多|
5858
|v0.11.0| 最新正式/稳定版本 |[快速开始](https://vllm-ascend.readthedocs.io/en/v0.11.0-dev/quick_start.html) and [安装指南](https://vllm-ascend.readthedocs.io/en/v0.11.0-dev/installation.html)了解更多|
5959

6060
## 贡献

docs/source/community/versioning_policy.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ The table below is the release compatibility matrix for vLLM Ascend release.
2222

2323
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu |
2424
|-------------|-------------------|-----------------|-------------|---------------------------------|
25+
| v0.13.0rc1 | v0.13.0 | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
2526
| v0.11.0 | v0.11.0 | >= 3.9 , < 3.12 | 8.3.RC2 | 2.7.1 / 2.7.1.post1 |
2627
| v0.12.0rc1 | v0.12.0 | >= 3.10, < 3.12 | 8.3.RC2 | 2.8.0 / 2.8.0 |
2728
| v0.11.0rc3 | v0.11.0 | >= 3.9, < 3.12 | 8.3.RC2 | 2.7.1 / 2.7.1.post1 |

docs/source/conf.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,19 +65,19 @@
6565
# the branch of vllm, used in vllm clone
6666
# - main branch: 'main'
6767
# - vX.Y.Z branch: 'vX.Y.Z'
68-
'vllm_version': 'v0.12.0',
68+
'vllm_version': 'v0.13.0',
6969
# the branch of vllm-ascend, used in vllm-ascend clone and image tag
7070
# - main branch: 'main'
7171
# - vX.Y.Z branch: latest vllm-ascend release tag
72-
'vllm_ascend_version': 'v0.12.0rc1',
72+
'vllm_ascend_version': 'v0.13.0rc1',
7373
# the newest release version of vllm-ascend and matched vLLM, used in pip install.
7474
# This value should be updated when cut down release.
75-
'pip_vllm_ascend_version': "0.12.0rc1",
76-
'pip_vllm_version': "0.12.0",
75+
'pip_vllm_ascend_version': "0.13.0rc1",
76+
'pip_vllm_version': "0.13.0",
7777
# CANN image tag
7878
'cann_image_tag': "8.3.rc2-910b-ubuntu22.04-py3.11",
7979
# vllm version in ci
80-
'ci_vllm_version': 'release/v0.13.0',
80+
'ci_vllm_version': 'v0.13.0',
8181
}
8282

8383
# For cross-file header anchors

docs/source/faqs.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## Version Specific FAQs
44

55
- [[v0.11.0] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/4808)
6-
- [[v0.12.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/4984)
6+
- [[v0.13.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/5333)
77

88
## General FAQs
99

@@ -243,3 +243,15 @@ Copy the `vllm_ascend_<tag>.tar` file (where `<tag>` is the image tag you used)
243243

244244
### 21. Why am I getting an error when executing the script to start a Docker container? The error message is: "operation not permitted".
245245
When using `--shm-size`, you may need to add the `--privileged=true` flag to your `docker run` command to grant the container necessary permissions. Please be aware that using `--privileged=true` grants the container extensive privileges on the host system, which can be a security risk. Only use this option if you understand the implications and trust the container's source.
246+
247+
### 22. How to achieve low latency in a small batch scenario?
248+
The performance of `torch_npu.npu_fused_infer_attention_score` in small batch scenario is not satisfactory, mainly due to the lack of flash decoding function. We offer an alternative operator in `tools/install_flash_infer_attention_score_ops_a2.sh` and `tools/install_flash_infer_attention_score_ops_a3.sh`, you can install it by the following instruction:
249+
250+
```bash
251+
bash tools/install_flash_infer_attention_score_ops_a2.sh
252+
## change to run the following instruction if you're using A3 machine
253+
# bash tools/install_flash_infer_attention_score_ops_a3.sh
254+
```
255+
256+
**NOTE**: Don't set `additional_config.pa_shape_list` when using this method, otherwise it will lead to another attention operator.
257+
**Important**: Please make sure you're using the **official image** of vllm-ascend, otherwise you **must change** the directory `/vllm-workspace` in `tools/install_flash_infer_attention_score_ops_a2.sh` or `tools/install_flash_infer_attention_score_ops_a3.sh` to your own or create one. If you're not in root user, you need `sudo` permission to run this script.

docs/source/user_guide/release_notes.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,52 @@
11
# Release Notes
2+
## v0.13.0rc1 - 2025.12.27
3+
This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest) to get started.
4+
5+
### Highlights
6+
- Improved the performance of DeepSeek V3.2, please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.2.html)
7+
- Qwen3-Next MTP with chunked prefill is supported now [#4770](https://github.com/vllm-project/vllm-ascend/pull/4770), please refer to [tutorials](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/Qwen3-Next.html)
8+
- [Experimental] Prefill Context Parallel and Decode Context Parallel are supported, but notice that it is an experimental feature now, welcome any feedback. please refer to [context parallel feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/context_parallel.html)
9+
10+
### Features
11+
- Support openPangu Ultra MoE [4615](https://github.com/vllm-project/vllm-ascend/pull/4615)
12+
- A new quantization method W8A16 is supported now. [#4541](https://github.com/vllm-project/vllm-ascend/pull/4541)
13+
- Cross-machine Disaggregated Prefill is supported now. [#5008](https://github.com/vllm-project/vllm-ascend/pull/5008)
14+
- Add UCMConnector for KV Cache Offloading. [#4411](https://github.com/vllm-project/vllm-ascend/pull/4411)
15+
- Support async_scheduler and disable_padded_drafter_batch in eagle. [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893)
16+
- Support pcp + mtp in full graph mode. [#4572](https://github.com/vllm-project/vllm-ascend/pull/4572)
17+
- Enhance all-reduce skipping logic for MoE models in NPUModelRunner [#5329](https://github.com/vllm-project/vllm-ascend/pull/5329)
18+
19+
### Performance
20+
Some general performance improvement:
21+
- Add l2norm triton kernel [#4595](https://github.com/vllm-project/vllm-ascend/pull/4595)
22+
- Add new pattern for AddRmsnormQuant with SP, which could only take effect in graph mode. [#5077](https://github.com/vllm-project/vllm-ascend/pull/5077)
23+
- Add async exponential while model executing. [#4501](https://github.com/vllm-project/vllm-ascend/pull/4501)
24+
- Remove the transpose step after attention and switch to transpose_batchmatmul [#5390](https://github.com/vllm-project/vllm-ascend/pull/5390)
25+
- To optimize the performance in small batch size scenario, an attention operator with flash decoding function is offered, please refer to item 22 in [FAQs](https://docs.vllm.ai/projects/ascend/en/latest/faqs.html) to enable it.
26+
27+
### Other
28+
- OOM error on VL models is fixed now. We're keeping observing it, if you hit OOM problem again, please submit an issue. [#5136](https://github.com/vllm-project/vllm-ascend/pull/5136)
29+
- Fixed an accuracy bug of Qwen3-Next-MTP when batched inferring. [#4932](https://github.com/vllm-project/vllm-ascend/pull/4932)
30+
- Fix npu-cpu offloading interface change bug. [#5290](https://github.com/vllm-project/vllm-ascend/pull/5290)
31+
- Fix MHA model runtime error in aclgraph mode [#5397](https://github.com/vllm-project/vllm-ascend/pull/5397)
32+
- Fix unsuitable moe_comm_type under ep=1 scenario [#5388](https://github.com/vllm-project/vllm-ascend/pull/5388)
33+
34+
### Deprecation & Breaking Changes
35+
- `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is removed and `VLLM_ASCEND_ENABLE_PREFETCH_MLP` is recommend to replace as they always be enabled together. [#5272](https://github.com/vllm-project/vllm-ascend/pull/5272)
36+
- `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` is dropped now. [#5270](https://github.com/vllm-project/vllm-ascend/pull/5270)
37+
- `VLLM_ASCEND_ENABLE_NZ` is disabled for float weight case, since we notice that the performance is not good in some float case. Feel free to set it to 2 if you make sure it works for your case. [#4878](https://github.com/vllm-project/vllm-ascend/pull/4878)
38+
- `chunked_prefill_for_mla` in `additional_config` is dropped now. [#5296](https://github.com/vllm-project/vllm-ascend/pull/5296)
39+
- `dump_config` in `additional_config` is renamed to `dump_config_path` and the type is change from `dict` to `string`. [#5296](https://github.com/vllm-project/vllm-ascend/pull/5296)
40+
41+
### Dependencies
42+
- vLLM version has been upgraded to 0.13.0 and drop 0.12.0 support. [#5146](https://github.com/vllm-project/vllm-ascend/pull/5146)
43+
- Transformer version has been upgraded >= 4.57.3 [#5250](https://github.com/vllm-project/vllm-ascend/pull/5250)
44+
45+
### Known Issues
46+
- Qwen3-Next doesn't support long sequence scenario, and we should limit `gpu-memory-utilization` according to the doc to run Qwen3-Next. We'll improve it in the next release
47+
- The functional break on Qwen3-Next when the input/output is around 3.5k/1.5k is fixed, but it introduces a regression on performance. We'll fix it in next release. [#5357](https://github.com/vllm-project/vllm-ascend/issues/5357)
48+
- There is a precision issue with curl on ultra-short sequences in DeepSeek-V3.2. We'll fix it in next release. [#5370](https://github.com/vllm-project/vllm-ascend/issues/5370)
49+
250
## v0.11.0 - 2025.12.16
351
We're excited to announce the release of v0.11.0 for vLLM Ascend. This is the official release for v0.11.0. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/v0.11.0-dev) to get started. We'll consider to release post version in the future if needed. This release note will only contain the important change and note from v0.11.0rc3.
452

0 commit comments

Comments
 (0)