Releases: baidu/vLLM-Kunlun
v0.11.0
vLLM-Kunlun v0.11.0
vLLM-Kunlun v0.11.0 featured 154 commits from 26 contributors (including new contributors)!
✨ Highlights
🤖 DeepSeek-V3/R1/V3.2 Full Support
vLLM-Kunlun v0.11.0 delivers complete support for the DeepSeek model family on Kunlun hardware:
- 🆕 Full inference support for DeepSeek-V3, R1, and V3.2-Exp (#78)
- 🚀 Multi-Token Prediction (MTP) support for DeepSeek-V3.2, with performance improvements in both Full and PieceWise modes (#164)
- ⚡ Enabled full CUDA Graph for DeepSeek models (#106)
- 🔧 Removed MLA patch;
--compilation-configis no longer required for DeepSeek-V3.1 (#145) - ⚡ Added kernels to optimize RoPE and the decoding stage for DeepSeek-V3.2 (#143)
🔀 Multi-LoRA Inference Optimization
- 🆕 Full multi-LoRA inference support on Kunlun hardware (#133)
- 🚀 Further optimized multi-LoRA performance; LoRA-enabled inference now achieves 80%+ of non-LoRA performance (#190)
🔍 Embedding Model Support
- 🆕 Support for BGE embedding models on Kunlun hardware, enabling vector retrieval and RAG use cases (#267)
🗜️ Quantization Enhancements
- 🆕 Support for Compressed-Tensors W8A8 quantization (#75)
- 🆕 Support for Compressed-Tensors W4A16 quantization (#154)
- 🆕 Support for AWQ MoE W4A16 quantization (#142)
- 🆕 Support for Mixed-Precision Quantization for MoE models (#112)
- 🆕 Added INT8 quantized model list for DeepSeek, Qwen, and MiniMax series (#254, #264)
⚡ Kernel Optimizations
- 🔄 Migrated XTorch operations to native Kunlun operations, accelerating iteration (#177)
- 🚀 Added
topk_per_rowkernel to optimize Top-K index calculation (#168) - 🚀 Added
flashinfer_rotary_embeddingandfast_topkv2kernels; optimizedint8_paged_mqa_logitswith parallelism (#134, #143) - 🚀 Enabled fast random sampling on the Kunlun3 platform via hardware generators (#73)
- 🚀 Optimized Fused MoE kernels for small batch inference (#196)
🆕 New Models
- 🤖 DeepSeek-V3 / R1 / V3.2-Exp: Full support for the DeepSeek series on Kunlun (#78, #164)
- 🤖 Qwen3-Next: Support for Qwen3-Next and Qwen-next architecture (#222)
- 🤖 GLM-4.7 / GLM-5: Support for GLM-4.7 with MTP, and GLM-x model family (#187, #194)
- 🖼️ InternVL2.5: Multimodal InternVL2.5 support on vLLM-Kunlun v0.11.0 (#72)
- 🖼️ XiaoMi MIMO Flash V2: Support for XiaoMi MIMO Flash V2 (#62)
- 🤖 MiniMax-M2.1 / MiniMax-M2.5: Support for MiniMax series with INT8 quantization (#264, #275 )
- 🤖 GPT-OSS: Support for GPT-OSS and updated model list (#71)
- 🔍 BGE Embedding Models: Support for BGE embedding models for vector retrieval and RAG (#267)
- 🛠️ GLM-4.7 Tool Parser: Added GLM-4.7 tool parser with thinking/non-thinking mode toggle (#151)
🔧 Features
🔍 Embedding
- 🆕 Support BGE embedding models on Kunlun; remove unnecessary params in attention implementation interfaces (#267) by @lishaobing448
🗜️ Quantization
- 🆕 Support Compressed-Tensors W8A8 quantization (#75) by @liwei109
- 🆕 Support Compressed-Tensors W4A16 quantization (#154) by @liwei109
- 🆕 Support AWQ MoE W4A16 quantization (#142) by @tangshiwen
- 🆕 Support Mixed-Precision Quantization for MoE (#112) by @tangshiwen
⚡ Kernels
- 🚀 Add kernels to optimize RoPE and the decoding stage for DeepSeek-V3.2 (#143) by @fromck
- 🚀 Add
topk_per_rowto optimize Top-K index calculation (#168) by @fromck - 🚀 Add 2 kernels (
flashinfer_rotary_embedding,fast_topkv2) and optimizetopk_indicescalculation (#134) by @fromck - 🚀 Enable fast random sampling on Kunlun3 platform with hardware generators (#73) by @yuqilinaa
- 🚀 Optimize Fused MoE kernels for small batch scenarios (#196) by @ldh2020
- 🆕 Add
gemma_rmsnorm,moe_pre_small, andsplit_norm_ropekernels (#180) by @Hanyu-Jin - 🆕 Add rejection sampler kernel (#215) by @Hanyu-Jin
- 🆕 Enable INT8 BMM (#91) by @zhihui96
🔀 Multi-LoRA
- 🆕 Full multi-LoRA inference support, requires latest xspeedgate (#133) by @15050188022
- 🚀 Further optimize multi-LoRA inference; LoRA performance achieves 80%+ of non-LoRA (#190) by @15050188022
🔮 MTP (Multi-Token Prediction)
- 🆕 MTP support for DeepSeek-V3.2 in Full and PieceWise modes (#164) by @15050188022
- 🆕 MTP support for GLM-4.7 (#187) by @fromck
- 🆕 MTP support for Qwen3-Next; optimize
apply_top_k_top_p(#268) by @ldh2020 - 🚀 Optimize MTP (#232) by @fromck
🏗️ Infrastructure
- 🔄 Migrate XTorch operations to Kunlun operations (#177) by @xyDong0223
- 🔄 Unify custom operator registration to
torch.opsusing OOT method (#203, #209) by @xyDong0223 - 🔄 Register
layernorm,rotary_embedding, andvocab_parallel_embeddingvia@CustomOp.register_oot(#234) by @lishiyong110 - ⚡ Enable full CUDA Graph for DeepSeek models (#106) by @baoqian426
- 🆕 Use data parallelism (DP) for distributed inference (#90) by @baoqian426
- 🆕 Eager mode support for expert parallelism (#260) by @Wfd567
- 🚀 Reduce host-device sync overhead in Qwen3.5 (#265) by @xyDong0223
- 🔄 Recover use of reshape-and-cache kernel to update Mamba cache (#261) by @xyDong0223
- 🛠️ Add
collect_envfeature for environment diagnostics (#218) by @Lidang-Jiang
🐛 Bug Fixes
- 🐛 Fix Kunlun Graph failure (#193) by @xyDong0223
- 🐛 Fix long-context chunked attention crash (#117) by @baoqian426
- 🐛 Fix
kunlun_scale_mmbias bug (#126) by @liwei109 - 🐛 Fix
cutlass_scaled_mminference error (#82) by @tangshiwen - 🐛 Fix MoE when bias is absent (#76) by @xyDong0223
- 🐛 Fix InternVL
KeyError: ((1, 1, 3), '<i8')(#108) by @Lidang-Jiang - 🐛 Fix
apply_top_k_top_pnot applied issue (#101) by @Hanyu-Jin - 🐛 Fix Qwen2-VL for v0.11.0 (#94) by @roger-lcc
- 🐛 Fix
compressed_tensorsimport error (#87) by @baoqian426 - 🐛 Fix
cocopodops not found (#242) by @liwei109 - 🐛 Fix missing
xspeedgate_opsimport in Kunlun ops and FLA chunk (#237, #238) by @xyDong0223 - 🐛 Fix distributed environment initialization issue (#231) by @xyDong0223
- 🐛 Adapt GLM5 config for
transformers4.57 (#207) by @tangshiwen - 🐛 Fix eager mode LayerNorm failure (#247) by @Hyfreadom
- 🐛 Register
apply_repetition_penalties_in custom op (#110) by @roger-lcc - 🐛 Fix expert parallelism bug in eager mode (#260) by @Wfd567
🔬 CI / Build
- 🆕 Add CI end-to-end (E2E) tests (#139) by @1916hcc
- 🆕 Add Unit Test (UT) CI (#157) by @Joeegin
- 🔄 Refactor E2E CI: split monolithic workflow into modular scripts (#162) by @1916hcc
- 🔧 Update
.pre-commit-config.yaml, add_pylint.yml(#155) by @WeiJie-520 - 🆕 Add foundational GitHub Actions configuration (#57) by @tanjunchen
- 🆕 Add
PULL_REQUEST_TEMPLATE.mdandISSUE_TEMPLATE(#56) by @tanjunchen - 🆕 Add
CODE_OF_CONDUCT.md,MAINTAINERS.md, and contributing guide (#55) by @tanjunchen
📝 Documentation
- 📖 Add vLLM-Kunlun New Model Adaptation Manual and update model support list (#211) by @xyDong0223
- 📖 Add XPU tutorials for Qwen and InternVL (#140) by @Joeegin
- 📖 Add DeepSeek-V3.2-Exp-w8a8 to installation guide and tutorials (#186) by @WeiJie-520
- 🔧 Update base image URL: replace conda with
uv; integrate xpytorch and ops into image (#146) by @WeiJie-520 - 📖 Update quantization guide documentation (#88) by @liwei109
- 📖 Optimize documentation structure (#136) by @Lidang-Jiang
- 📖 Update
xspeedgate_opsdocumentation (#188) by @WeiJie-520 - 🐛 Fix Read the Docs build configuration (#210, #251) by @xyDong0223, @Lidang-Jiang
- 📖 Update README with latest model support and environment information (#206) by @xyDong0223
- 📖 Add INT8 quantized model list for DeepSeek, Qwen, MiniMax series (#254) by @liwei109
- 🔧 Remove
--compilation-configfrom all documentation; P800 no longer requires this parameter (#253) by @Lidang-Jiang
📋 What's Changed
| PR | Title | Author |
|---|---|---|
| #268 | [Model] Support Qwen3-Next MTP | @ldh2020 |
| #267 | [Feature] Support BGE embedding models | @lishaobing448 |
| #265 | [Misc] Reduce Host and device sync in Qwen3.5 | @xyDong0223 |
| #264 | [Models] Add Qwen3.5 and MiniMax INT8 models | @liwei109 |
| #262 | [Bugfix] Fix function call invoking xgrammar failed | @xyDong0223 |
| #261 | [Model] Recover use reshape and cache kernel to update mamba cache | @xyDong0223 |
| #260 | Eager mode support expert parallel | @Wfd567 |
| #257 | [Bugfix] Update Qwen3.5 reasoning parser | @ljayx |
| #254 | [Doc] Add INT8 model list | @liwei109 |
| #253 | [Doc] Remove --compilation-config from all docs | @Lidang-Jiang |
| #251 | [Doc] Fix 5 Sphinx warnings causing Read the Docs build failure | @Lidang-Jiang |
| #252 | [Bugfix] Fix cache indices problem for Qwen3.5-MoE | @xyDong0223 |
| #247 | [Bugfix] Fix eager mode layernorm failed | @Hyfreadom |
| #244 | [Bugfix] use cuda visible | @lishaobing448 |
| #242 | [Bugfix] cocopod ops can't be finded | @liwei109 |
| #241 | [Model] Support qwen3.5 moe | @roger-lcc |
| #240 | [Misc] Remove qwen3 and qwen3moe redundant code | @xyDong0223 |
| #239 | [Doc] Update dependencies for Feb | @Joeegin |
| #238 | [Bugfix] Fix miss import xspeedgate_ops in kunlun ops | @xyDong0223 |
| #237 | [Bugfix] Fix miss import xspeedgate_ops in fla chunk | @xyDong0223 |
| #234 | [Feature] Register layernorm/rotary_embedding via @CustomOp.register_oot | @lishiyong110 |
| #233 | [Model] Support qwen3-next model | @xyDong0223 |
| #232 | [Misc] Optimize mtp | @fromck |
| #231 | [Bugfix] Fixed distributed environment initialization issue | @xyDong0223 |
| #229 | [Misc] Temporarily work around Torch compatibility issues | @xyDong0223 |
| #228 | [Update] Update dependencies for v0.15.1 | @xyDong0223 |
| #227 | [Update] Partially supports torch compile | @xyDong0223 |
| #225 | [Doc] Update dependencies | @Joeegin |
| #224 | [Kernel] Register custom_op for kunlun graph (torch compile) | @xyDong0223 |
| #222 | [Feature] Support Qwen3-Next | @chanzhennan |
| #... |
v0.11.0rc1
vLLM-Kunlun v0.11.0rc1
Hi, vLLM-Kunlun v0.11.0rc1 has been officially released!
Going forward, you can continue submitting code based on the main branch.
v0.11.0rc1 Release
Supported models
- Qwen3-Omini
- Qwen3-Next
- Seed-OSS
Comming soon
- DeepSeek V3
- DeepSeek R1
- DeepSeek V3.1
- DeepSeek V3.2
Operator updates🚀
BUG FIX❤️🩹
Known issues⚠️
v0.10.1.1
We are very pleased to announce the official release of vLLM Kunlun v0.10.1.1!
Going forward, if there is demand, we will continue to release patch updates and feature enhancement versions, and will periodically share the latest features and models supported by vLLM Kunlun. Stay tuned.
0.10.1.1 Release
Highlights✨
-
Comprehensive enhancements to multimodal capabilities now support 5+ series multimodal models, with overall inference throughput reaching up to 90% of the Axx platform.
-
A major breakthrough in sampling performance completely eliminates the Top-K sorting bottleneck; when enabled, end-to-end throughput can improve by up to 10× compared to the native implementation.
-
Quantized inference is now fully production-ready, with support for AWQ / GPTQ quantization for dense models, delivering significant gains compared to FP16:
- Significant reduction in GPU memory usage.
- Compute throughput is doubled.
-
Support for multi-LoRA inference.
-
Support for Piecewise CUDA Graph, significantly reducing scheduling and kernel launch overhead.
-
Support for the vLLM V1 inference engine.
Supported models
- Qwen2.5
- Qwen2.5-VL
- Qwen3
- Qwen3-MoE
- GLM4.1v
- GLM4.5
- GLM4.5Air
- GLM4.5v
- InternVL25
- InternVL35
- QiFanVL
Operator updates🚀
- KLX xtorch_ops operator library
- Added Flash-Infer Top-K / Top-P sampling operators. Compared to the original sorting-based logic, sampling-stage performance is improved by tens to hundreds of times.
BUG FIX❤️🩹
- Fixed issues with YaRN positional encoding, resolving garbled outputs in some models when exceeding the native context length.
- Fixed Rotary Positional Encoding (RoPE) precision issues.
- Fixed abnormal errors when
repetition_penalty > 1. - Fixed XPU INT4 data layout issues, significantly improving the performance of AWQ / GPTQ–related operators on XPU.
Known issues⚠️
- Errors may occur when invoking xgrammar in Function Call scenarios.
- Cause: The relevant operators are not yet supported.
- Future: Support will be gradually added in upcoming releases.