Skip to content

Commit 439995b

Browse files
committed
fix grammar
Signed-off-by: heheda <[email protected]>
1 parent fc3e742 commit 439995b

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

_posts/2025-09-11-qwen3-next.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,11 @@ In order to manage state for hybrid models like Qwen3-Next, vLLM automatically t
5050
</p>
5151

5252

53-
In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios
53+
In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios.
5454

5555
## **High-Sparsity MoE: Extreme Efficiency**
5656

57-
Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation.
57+
Qwen3-Next pushes sparsity further with **MoE layers at a 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation.
5858

5959

6060
## **Multi-Token Prediction (MTP)**
@@ -73,13 +73,13 @@ Our Qwen3-Next integration is just the beginning. On the roadmap:
7373

7474
This effort was made possible thanks to close collaboration with many partners:
7575

76-
* **Qwen Team**, including Tao He, Jianwei Zhang for open-sourcing the model.
76+
* **Qwen Team**, including Tao He, Jianwei Zhang, for open-sourcing the model.
7777
* **Flash Linear Attention team**, including Yu Zhang, etc. for reviewing the gated deltanet attention kernels and improving the numerics.
7878
* **NVIDIA**, including Vadim Gimpelson for testing the models.
7979
* **IBM Research**, including Thomas Parnell for hybrid memory management and CUDA graph optimizations.
8080
* **Red Hat**, including Tyler Michael Smith, Doug Smith, Tarun Kumar, and Elvir Crncevic for testing the model and tuning MoE kernels.
81-
* **Community partners**: Roblox, Meta, for testing, feedback, and scaling insights.
81+
* **Community partners**: Roblox, Meta, etc. for testing, feedback, and scaling insights.
8282

83-
vLLM team members who contributed to this effort are: Jie Li, Kaichao You, Chen Zhang, Simon Mo.
83+
vLLM team members who contributed to this effort include: Jie Li, Kaichao You, Chen Zhang, Simon Mo.
8484

8585
👉 Qwen3-Next is now available in **vLLM**. Try it out today and experience **ultra-efficient long-context inference** with the latest hybrid MoE architecture.

0 commit comments

Comments
 (0)