v0.11.0rc3
Pre-release
Pre-release
·
15 commits
to v0.11.0-dev
since this release
This is the third release candidate of v0.11.0 for vLLM Ascend. For quality reasons, we released a new rc before the official release. Thanks for all your feedback. Please follow the official doc to get started.
Highlights
- torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to pypi mirror. So it's hard to add it to auto dependence. Please install it by yourself.
- Disable NZ weight loader to speed up dense model. Please note that this is a temporary solution. If you find the performance becomes bad, please let us know. We'll keep improving it. #4495
- mooncake is installed in official docker image now. You can use it directly in container now. #4506
Other
- Fix an OOM issue for moe models. #4367
- Fix hang issue of multimodal model when running with DP>1 #4393
- Fix some bugs for EPLB #4416
- Fix bug for mtp>1 + lm_head_tp>1 case #4360
- Fix a accuracy issue when running vLLM serve for long time. #4117
- Fix a function bug when running qwen2.5 vl under high concurrency. #4553
Full Changelog: v0.11.0rc2...v0.11.0rc3