You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-11-qwen3-next.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,11 +50,11 @@ In order to manage state for hybrid models like Qwen3-Next, vLLM automatically t
50
50
</p>
51
51
52
52
53
-
In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios
53
+
In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios.
54
54
55
55
## **High-Sparsity MoE: Extreme Efficiency**
56
56
57
-
Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation.
57
+
Qwen3-Next pushes sparsity further with **MoE layers at a 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation.
58
58
59
59
60
60
## **Multi-Token Prediction (MTP)**
@@ -73,13 +73,13 @@ Our Qwen3-Next integration is just the beginning. On the roadmap:
73
73
74
74
This effort was made possible thanks to close collaboration with many partners:
75
75
76
-
***Qwen Team**, including Tao He, Jianwei Zhang for open-sourcing the model.
76
+
***Qwen Team**, including Tao He, Jianwei Zhang, for open-sourcing the model.
77
77
***Flash Linear Attention team**, including Yu Zhang, etc. for reviewing the gated deltanet attention kernels and improving the numerics.
78
78
***NVIDIA**, including Vadim Gimpelson for testing the models.
79
79
***IBM Research**, including Thomas Parnell for hybrid memory management and CUDA graph optimizations.
80
80
***Red Hat**, including Tyler Michael Smith, Doug Smith, Tarun Kumar, and Elvir Crncevic for testing the model and tuning MoE kernels.
81
-
***Community partners**: Roblox, Meta, — for testing, feedback, and scaling insights.
81
+
***Community partners**: Roblox, Meta, etc. for testing, feedback, and scaling insights.
82
82
83
-
vLLM team members who contributed to this effort are: Jie Li, Kaichao You, Chen Zhang, Simon Mo.
83
+
vLLM team members who contributed to this effort include: Jie Li, Kaichao You, Chen Zhang, Simon Mo.
84
84
85
85
👉 Qwen3-Next is now available in **vLLM**. Try it out today and experience **ultra-efficient long-context inference** with the latest hybrid MoE architecture.
0 commit comments