-
Notifications
You must be signed in to change notification settings - Fork 29
add qwen3-next #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
+85
−0
Merged
add qwen3-next #83
Changes from 6 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
ef79969
add qwen3-next
heheda12345 1897600
update
heheda12345 a2c093d
update moe
heheda12345 a907842
update figure
heheda12345 f0d9ada
update figure
heheda12345 fc3e742
update
heheda12345 439995b
fix grammar
heheda12345 2ab327e
add pd
heheda12345 3952ece
fix
heheda12345 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
--- | ||
layout: post | ||
title: "vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency" | ||
author: "The vLLM Team" | ||
image: /assets/figures/qwen3-next/qwen.png | ||
thumbnail-img: /assets/figures/qwen3-next/qwen.png | ||
share-img: /assets/figures/qwen3-next/qwen.png | ||
--- | ||
|
||
We’re excited to announce that **vLLM now supports Qwen3-Next**, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a **hybrid architecture with extreme efficiency for long context support**, and vLLM offers full support of its functionalities. | ||
|
||
<p align="center"> | ||
<picture> | ||
<img src="/assets/figures/qwen3-next/qwen.png" width="30%"> | ||
</picture> | ||
</p> | ||
|
||
|
||
In this post, we’ll explore Qwen3-Next’s innovations — hybrid attention, high-sparsity MoE, and multi-token prediction — and show how vLLM efficiently supports them. | ||
|
||
## **Quickstart** | ||
|
||
You can run Qwen3-Next with vLLM nightly installation: | ||
|
||
`uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly --torch-backend=auto` | ||
|
||
Then launch: | ||
|
||
`vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4` | ||
|
||
Please refer to the [vLLM Model Recipes](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html) for a more detailed installation and usage guide. | ||
|
||
## **Hybrid Attention: Efficient Context Modeling** | ||
|
||
At the core of Qwen3-Next is its **Hybrid Attention** design, replacing standard attention with a combination of: | ||
|
||
* **Gated DeltaNet** (linear attention for long context efficiency) | ||
* **Full Attention** (full attention for high-fidelity reasoning) | ||
|
||
The model interleaves these two forms of attention across layers, enabling efficient scaling to **65K context lengths** and beyond. | ||
|
||
To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [hybrid KV cache manager](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers, avoiding fragmentation and maximizing GPU utilization. | ||
|
||
In order to manage state for hybrid models like Qwen3-Next, vLLM automatically tunes the “logical” block size of the full attention layers to ensure that the state for the full attention layers and linear attention layers occupy the same amount of “physical” GPU memory. This enables simple and efficient paged memory management for hybrid models, increasing throughput for heavy workloads when the GPU memory becomes fully utilized. | ||
|
||
<p align="center"> | ||
<picture> | ||
<img src="/assets/figures/qwen3-next/hybrid.png" width="100%"> | ||
</picture> | ||
</p> | ||
|
||
|
||
In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios | ||
|
||
## **High-Sparsity MoE: Extreme Efficiency** | ||
|
||
Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation. | ||
|
||
|
||
## **Multi-Token Prediction (MTP)** | ||
|
||
Another innovation in Qwen3-Next is **multi-token prediction**, which boosts both pretraining efficiency and inference speed. vLLM natively supports this mode, allowing Qwen3-Next to decode multiple tokens per step without modifying application code. See the recipe to check out how to use it. | ||
|
||
## **Looking Ahead** | ||
|
||
Our Qwen3-Next integration is just the beginning. On the roadmap: | ||
|
||
* Further kernel optimizations for GatedDeltaNet layers. | ||
* Better memory management and prefix caching for hybrid models. | ||
* Continuous throughput and CPU overhead reductions. | ||
|
||
## **Acknowledgements** | ||
|
||
This effort was made possible thanks to close collaboration with many partners: | ||
|
||
* **Qwen Team**, including Tao He, Jianwei Zhang for open-sourcing the model. | ||
* **Flash Linear Attention team**, including Yu Zhang, etc. for reviewing the gated deltanet attention kernels and improving the numerics. | ||
* **NVIDIA**, including Vadim Gimpelson for testing the models. | ||
* **IBM Research**, including Thomas Parnell for hybrid memory management and CUDA graph optimizations. | ||
* **Red Hat**, including Tyler Michael Smith, Doug Smith, Tarun Kumar, and Elvir Crncevic for testing the model and tuning MoE kernels. | ||
* **Community partners**: Roblox, Meta, — for testing, feedback, and scaling insights. | ||
|
||
vLLM team members who contributed to this effort are: Jie Li, Kaichao You, Chen Zhang, Simon Mo. | ||
|
||
👉 Qwen3-Next is now available in **vLLM**. Try it out today and experience **ultra-efficient long-context inference** with the latest hybrid MoE architecture. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specifically call out automatic prefix caching (which is a prerequisite for P/D disaggregation)?
Could ref this WiP PR vllm-project/vllm#23941
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added P/D. But prefer not to ref ongoing PRs as the author may close this one and open another PR.