vllm-project · simon-mo · Sep 11, 2025 · Sep 11, 2025 · Sep 11, 2025 · Sep 11, 2025
diff --git a/_posts/2025-09-11-qwen3-next.md b/_posts/2025-09-11-qwen3-next.md
@@ -0,0 +1,85 @@
+---
+layout: post
+title: "vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency"
+author: "The vLLM Team"
+image: /assets/figures/qwen3-next/qwen.png
+thumbnail-img: /assets/figures/qwen3-next/qwen.png
+share-img: /assets/figures/qwen3-next/qwen.png
+---
+
+We’re excited to announce that **vLLM now supports Qwen3-Next**, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a **hybrid architecture with extreme efficiency for long context support**, and vLLM offers full support of its functionalities.
+
+<p align="center">
+<picture>
+<img src="/assets/figures/qwen3-next/qwen.png" width="30%">
+</picture>
+</p>
+
+
+In this post, we’ll explore Qwen3-Next’s innovations — hybrid attention, high-sparsity MoE, and multi-token prediction — and show how vLLM efficiently supports them.
+
+## **Quickstart**
+
+You can run Qwen3-Next with vLLM nightly installation:
+
+`uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly --torch-backend=auto`
+
+Then launch:
+
+`vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4` 
+
+Please refer to the [vLLM Model Recipes](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html) for a more detailed installation and usage guide. 
+
+## **Hybrid Attention: Efficient Context Modeling**
+
+At the core of Qwen3-Next is its **Hybrid Attention** design, replacing standard attention with a combination of:
+
+* **Gated DeltaNet** (linear attention for long context efficiency)  
+* **Full Attention** (full attention for high-fidelity reasoning)
+
+The model interleaves these two forms of attention across layers, enabling efficient scaling to **65K context lengths** and beyond.
+
+To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [hybrid KV cache manager](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers, avoiding fragmentation and maximizing GPU utilization.
+
+In order to manage state for hybrid models like Qwen3-Next, vLLM automatically tunes the “logical” block size of the full attention layers to ensure that the state for the full attention layers and linear attention layers occupy the same amount of “physical” GPU memory. This enables simple and efficient paged memory management for hybrid models, increasing throughput for heavy workloads when the GPU memory becomes fully utilized.
+
+<p align="center">
+<picture>
+<img src="/assets/figures/qwen3-next/hybrid.png" width="100%">
+</picture>
+</p>
+
+
+In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios
+
+## **High-Sparsity MoE: Extreme Efficiency**
+
+Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation.
+
+
+## **Multi-Token Prediction (MTP)**
+
+Another innovation in Qwen3-Next is **multi-token prediction**, which boosts both pretraining efficiency and inference speed. vLLM natively supports this mode, allowing Qwen3-Next to decode multiple tokens per step without modifying application code. See the recipe to check out how to use it.
+
+## **Looking Ahead**
+
+Our Qwen3-Next integration is just the beginning. On the roadmap:
+
+* Further kernel optimizations for GatedDeltaNet layers.  
+* Better memory management and prefix caching for hybrid models.  
+* Continuous throughput and CPU overhead reductions.
+
+## **Acknowledgements**
+
+This effort was made possible thanks to close collaboration with many partners:
+
+* **Qwen Team**, including Tao He, Jianwei Zhang for open-sourcing the model.  
+* **Flash Linear Attention team**, including Yu Zhang, etc. for reviewing the gated deltanet attention kernels and improving the numerics.  
+* **NVIDIA**, including Vadim Gimpelson for testing the models.  
+* **IBM Research**, including Thomas Parnell for hybrid memory management and CUDA graph optimizations.  
+* **Red Hat**, including Tyler Michael Smith, Doug Smith, Tarun Kumar, and Elvir Crncevic for testing the model and tuning MoE kernels.  
+* **Community partners**: Roblox, Meta,  — for testing, feedback, and scaling insights.
+
+vLLM team members who contributed to this effort are: Jie Li, Kaichao You, Chen Zhang, Simon Mo.
+
+👉 Qwen3-Next is now available in **vLLM**. Try it out today and experience **ultra-efficient long-context inference** with the latest hybrid MoE architecture.
diff --git a/assets/figures/qwen3-next/hybrid.png b/assets/figures/qwen3-next/hybrid.png
diff --git a/assets/figures/qwen3-next/qwen.png b/assets/figures/qwen3-next/qwen.png