13 Mar 12:08

xyDong0223

08f0291

v0.11.0 Pre-release

Pre-release

vLLM-Kunlun v0.11.0

vLLM-Kunlun v0.11.0 featured 154 commits from 26 contributors (including new contributors)!

✨ Highlights

🤖 DeepSeek-V3/R1/V3.2 Full Support

vLLM-Kunlun v0.11.0 delivers complete support for the DeepSeek model family on Kunlun hardware:

🆕 Full inference support for DeepSeek-V3, R1, and V3.2-Exp (#78)
🚀 Multi-Token Prediction (MTP) support for DeepSeek-V3.2, with performance improvements in both Full and PieceWise modes (#164)
⚡ Enabled full CUDA Graph for DeepSeek models (#106)
🔧 Removed MLA patch; --compilation-config is no longer required for DeepSeek-V3.1 (#145)
⚡ Added kernels to optimize RoPE and the decoding stage for DeepSeek-V3.2 (#143)

🔀 Multi-LoRA Inference Optimization

🆕 Full multi-LoRA inference support on Kunlun hardware (#133)
🚀 Further optimized multi-LoRA performance; LoRA-enabled inference now achieves 80%+ of non-LoRA performance (#190)

🔍 Embedding Model Support

🆕 Support for BGE embedding models on Kunlun hardware, enabling vector retrieval and RAG use cases (#267)

🗜️ Quantization Enhancements

🆕 Support for Compressed-Tensors W8A8 quantization (#75)
🆕 Support for Compressed-Tensors W4A16 quantization (#154)
🆕 Support for AWQ MoE W4A16 quantization (#142)
🆕 Support for Mixed-Precision Quantization for MoE models (#112)
🆕 Added INT8 quantized model list for DeepSeek, Qwen, and MiniMax series (#254, #264)

⚡ Kernel Optimizations

🔄 Migrated XTorch operations to native Kunlun operations, accelerating iteration (#177)
🚀 Added topk_per_row kernel to optimize Top-K index calculation (#168)
🚀 Added flashinfer_rotary_embedding and fast_topkv2 kernels; optimized int8_paged_mqa_logits with parallelism (#134, #143)
🚀 Enabled fast random sampling on the Kunlun3 platform via hardware generators (#73)
🚀 Optimized Fused MoE kernels for small batch inference (#196)

🆕 New Models

🤖 DeepSeek-V3 / R1 / V3.2-Exp: Full support for the DeepSeek series on Kunlun (#78, #164)
🤖 Qwen3-Next: Support for Qwen3-Next and Qwen-next architecture (#222)
🤖 GLM-4.7 / GLM-5: Support for GLM-4.7 with MTP, and GLM-x model family (#187, #194)
🖼️ InternVL2.5: Multimodal InternVL2.5 support on vLLM-Kunlun v0.11.0 (#72)
🖼️ XiaoMi MIMO Flash V2: Support for XiaoMi MIMO Flash V2 (#62)
🤖 MiniMax-M2.1 / MiniMax-M2.5: Support for MiniMax series with INT8 quantization (#264, #275 )
🤖 GPT-OSS: Support for GPT-OSS and updated model list (#71)
🔍 BGE Embedding Models: Support for BGE embedding models for vector retrieval and RAG (#267)
🛠️ GLM-4.7 Tool Parser: Added GLM-4.7 tool parser with thinking/non-thinking mode toggle (#151)

🔧 Features

🔍 Embedding

🆕 Support BGE embedding models on Kunlun; remove unnecessary params in attention implementation interfaces (#267) by @lishaobing448

🗜️ Quantization

🆕 Support Compressed-Tensors W8A8 quantization (#75) by @liwei109
🆕 Support Compressed-Tensors W4A16 quantization (#154) by @liwei109
🆕 Support AWQ MoE W4A16 quantization (#142) by @tangshiwen
🆕 Support Mixed-Precision Quantization for MoE (#112) by @tangshiwen

⚡ Kernels

🚀 Add kernels to optimize RoPE and the decoding stage for DeepSeek-V3.2 (#143) by @fromck
🚀 Add topk_per_row to optimize Top-K index calculation (#168) by @fromck
🚀 Add 2 kernels (flashinfer_rotary_embedding, fast_topkv2) and optimize topk_indices calculation (#134) by @fromck
🚀 Enable fast random sampling on Kunlun3 platform with hardware generators (#73) by @yuqilinaa
🚀 Optimize Fused MoE kernels for small batch scenarios (#196) by @ldh2020
🆕 Add gemma_rmsnorm, moe_pre_small, and split_norm_rope kernels (#180) by @Hanyu-Jin
🆕 Add rejection sampler kernel (#215) by @Hanyu-Jin
🆕 Enable INT8 BMM (#91) by @zhihui96

🔀 Multi-LoRA

🆕 Full multi-LoRA inference support, requires latest xspeedgate (#133) by @15050188022
🚀 Further optimize multi-LoRA inference; LoRA performance achieves 80%+ of non-LoRA (#190) by @15050188022

🔮 MTP (Multi-Token Prediction)

🆕 MTP support for DeepSeek-V3.2 in Full and PieceWise modes (#164) by @15050188022
🆕 MTP support for GLM-4.7 (#187) by @fromck
🆕 MTP support for Qwen3-Next; optimize apply_top_k_top_p (#268) by @ldh2020
🚀 Optimize MTP (#232) by @fromck

🏗️ Infrastructure

🔄 Migrate XTorch operations to Kunlun operations (#177) by @xyDong0223
🔄 Unify custom operator registration to torch.ops using OOT method (#203, #209) by @xyDong0223
🔄 Register layernorm, rotary_embedding, and vocab_parallel_embedding via @CustomOp.register_oot (#234) by @lishiyong110
⚡ Enable full CUDA Graph for DeepSeek models (#106) by @baoqian426
🆕 Use data parallelism (DP) for distributed inference (#90) by @baoqian426
🆕 Eager mode support for expert parallelism (#260) by @Wfd567
🚀 Reduce host-device sync overhead in Qwen3.5 (#265) by @xyDong0223
🔄 Recover use of reshape-and-cache kernel to update Mamba cache (#261) by @xyDong0223
🛠️ Add collect_env feature for environment diagnostics (#218) by @Lidang-Jiang

🐛 Bug Fixes

🐛 Fix Kunlun Graph failure (#193) by @xyDong0223
🐛 Fix long-context chunked attention crash (#117) by @baoqian426
🐛 Fix kunlun_scale_mm bias bug (#126) by @liwei109
🐛 Fix cutlass_scaled_mm inference error (#82) by @tangshiwen
🐛 Fix MoE when bias is absent (#76) by @xyDong0223
🐛 Fix InternVL KeyError: ((1, 1, 3), '<i8') (#108) by @Lidang-Jiang
🐛 Fix apply_top_k_top_p not applied issue (#101) by @Hanyu-Jin
🐛 Fix Qwen2-VL for v0.11.0 (#94) by @roger-lcc
🐛 Fix compressed_tensors import error (#87) by @baoqian426
🐛 Fix cocopod ops not found (#242) by @liwei109
🐛 Fix missing xspeedgate_ops import in Kunlun ops and FLA chunk (#237, #238) by @xyDong0223
🐛 Fix distributed environment initialization issue (#231) by @xyDong0223
🐛 Adapt GLM5 config for transformers 4.57 (#207) by @tangshiwen
🐛 Fix eager mode LayerNorm failure (#247) by @Hyfreadom
🐛 Register apply_repetition_penalties_ in custom op (#110) by @roger-lcc
🐛 Fix expert parallelism bug in eager mode (#260) by @Wfd567

🔬 CI / Build

🆕 Add CI end-to-end (E2E) tests (#139) by @1916hcc
🆕 Add Unit Test (UT) CI (#157) by @Joeegin
🔄 Refactor E2E CI: split monolithic workflow into modular scripts (#162) by @1916hcc
🔧 Update .pre-commit-config.yaml, add _pylint.yml (#155) by @WeiJie-520
🆕 Add foundational GitHub Actions configuration (#57) by @tanjunchen
🆕 Add PULL_REQUEST_TEMPLATE.md and ISSUE_TEMPLATE (#56) by @tanjunchen
🆕 Add CODE_OF_CONDUCT.md, MAINTAINERS.md, and contributing guide (#55) by @tanjunchen

📝 Documentation

📖 Add vLLM-Kunlun New Model Adaptation Manual and update model support list (#211) by @xyDong0223
📖 Add XPU tutorials for Qwen and InternVL (#140) by @Joeegin
📖 Add DeepSeek-V3.2-Exp-w8a8 to installation guide and tutorials (#186) by @WeiJie-520
🔧 Update base image URL: replace conda with uv; integrate xpytorch and ops into image (#146) by @WeiJie-520
📖 Update quantization guide documentation (#88) by @liwei109
📖 Optimize documentation structure (#136) by @Lidang-Jiang
📖 Update xspeedgate_ops documentation (#188) by @WeiJie-520
🐛 Fix Read the Docs build configuration (#210, #251) by @xyDong0223, @Lidang-Jiang
📖 Update README with latest model support and environment information (#206) by @xyDong0223
📖 Add INT8 quantized model list for DeepSeek, Qwen, MiniMax series (#254) by @liwei109
🔧 Remove --compilation-config from all documentation; P800 no longer requires this parameter (#253) by @Lidang-Jiang

📋 What's Changed

PR	Title	Author
#268	[Model] Support Qwen3-Next MTP	@ldh2020
#267	[Feature] Support BGE embedding models	@lishaobing448
#265	[Misc] Reduce Host and device sync in Qwen3.5	@xyDong0223
#264	[Models] Add Qwen3.5 and MiniMax INT8 models	@liwei109
#262	[Bugfix] Fix function call invoking xgrammar failed	@xyDong0223
#261	[Model] Recover use reshape and cache kernel to update mamba cache	@xyDong0223
#260	Eager mode support expert parallel	@Wfd567
#257	[Bugfix] Update Qwen3.5 reasoning parser	@ljayx
#254	[Doc] Add INT8 model list	@liwei109
#253	[Doc] Remove --compilation-config from all docs	@Lidang-Jiang
#251	[Doc] Fix 5 Sphinx warnings causing Read the Docs build failure	@Lidang-Jiang
#252	[Bugfix] Fix cache indices problem for Qwen3.5-MoE	@xyDong0223
#247	[Bugfix] Fix eager mode layernorm failed	@Hyfreadom
#244	[Bugfix] use cuda visible	@lishaobing448
#242	[Bugfix] cocopod ops can't be finded	@liwei109
#241	[Model] Support qwen3.5 moe	@roger-lcc
#240	[Misc] Remove qwen3 and qwen3moe redundant code	@xyDong0223
#239	[Doc] Update dependencies for Feb	@Joeegin
#238	[Bugfix] Fix miss import xspeedgate_ops in kunlun ops	@xyDong0223
#237	[Bugfix] Fix miss import xspeedgate_ops in fla chunk	@xyDong0223
#234	[Feature] Register layernorm/rotary_embedding via @CustomOp.register_oot	@lishiyong110
#233	[Model] Support qwen3-next model	@xyDong0223
#232	[Misc] Optimize mtp	@fromck
#231	[Bugfix] Fixed distributed environment initialization issue	@xyDong0223
#229	[Misc] Temporarily work around Torch compatibility issues	@xyDong0223
#228	[Update] Update dependencies for v0.15.1	@xyDong0223
#227	[Update] Partially supports torch compile	@xyDong0223
#225	[Doc] Update dependencies	@Joeegin
#224	[Kernel] Register custom_op for kunlun graph (torch compile)	@xyDong0223
#222	[Feature] Support Qwen3-Next	@chanzhennan
#...

Contributors

lishiyong110, chanzhennan, and 24 other contributors

Assets 2

0 Join discussion

26 Dec 10:38

baoqian426

v0.11.0rc1

45c6b8e

v0.11.0rc1 Latest

Latest

vLLM-Kunlun v0.11.0rc1

Hi, vLLM-Kunlun v0.11.0rc1 has been officially released!
Going forward, you can continue submitting code based on the main branch.

v0.11.0rc1 Release

Supported models

Qwen3-Omini
Qwen3-Next
Seed-OSS

Comming soon

DeepSeek V3
DeepSeek R1
DeepSeek V3.1
DeepSeek V3.2

Operator updates🚀

BUG FIX❤️‍🩹

Known issues⚠️

Assets 2

0 Join discussion

24 Dec 04:09

baoqian426

v0.10.1.1

5a794e9

v0.10.1.1

We are very pleased to announce the official release of vLLM Kunlun v0.10.1.1!

Going forward, if there is demand, we will continue to release patch updates and feature enhancement versions, and will periodically share the latest features and models supported by vLLM Kunlun. Stay tuned.

0.10.1.1 Release

Highlights✨

Comprehensive enhancements to multimodal capabilities now support 5+ series multimodal models, with overall inference throughput reaching up to 90% of the Axx platform.
A major breakthrough in sampling performance completely eliminates the Top-K sorting bottleneck; when enabled, end-to-end throughput can improve by up to 10× compared to the native implementation.
Quantized inference is now fully production-ready, with support for AWQ / GPTQ quantization for dense models, delivering significant gains compared to FP16:
- Significant reduction in GPU memory usage.
- Compute throughput is doubled.
Support for multi-LoRA inference.
Support for Piecewise CUDA Graph, significantly reducing scheduling and kernel launch overhead.
Support for the vLLM V1 inference engine.

Supported models

Qwen2.5
Qwen2.5-VL
Qwen3
Qwen3-MoE
GLM4.1v
GLM4.5
GLM4.5Air
GLM4.5v
InternVL25
InternVL35
QiFanVL

Operator updates🚀

KLX xtorch_ops operator library
- Added Flash-Infer Top-K / Top-P sampling operators. Compared to the original sorting-based logic, sampling-stage performance is improved by tens to hundreds of times.

BUG FIX❤️‍🩹

Fixed issues with YaRN positional encoding, resolving garbled outputs in some models when exceeding the native context length.
Fixed Rotary Positional Encoding (RoPE) precision issues.
Fixed abnormal errors when repetition_penalty > 1.
Fixed XPU INT4 data layout issues, significantly improving the performance of AWQ / GPTQ–related operators on XPU.

Known issues⚠️

Errors may occur when invoking xgrammar in Function Call scenarios.
- Cause: The relevant operators are not yet supported.
- Future: Support will be gradually added in upcoming releases.

Assets 4

0 Join discussion

Releases: baidu/vLLM-Kunlun

v0.11.0

vLLM-Kunlun v0.11.0

✨ Highlights

🆕 New Models

🔧 Features

🐛 Bug Fixes

🔬 CI / Build

📝 Documentation

📋 What's Changed

Contributors

Uh oh!

v0.11.0rc1

v0.11.0rc1 Release

Supported models

Comming soon

Operator updates🚀

BUG FIX❤️‍🩹

Known issues⚠️

Uh oh!

v0.10.1.1

0.10.1.1 Release

Highlights✨

Supported models

Operator updates🚀

BUG FIX❤️‍🩹

Known issues⚠️

Uh oh!