You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-11-qwen3-next.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ At the core of Qwen3-Next is its **Hybrid Attention** design, replacing standard
39
39
40
40
The model interleaves these two forms of attention across layers, enabling efficient scaling to **65K context lengths** and beyond.
41
41
42
-
To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [**hybrid KV cache manager**](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers with the same memory pool, avoiding fragmentation and maximizing GPU utilization.
42
+
To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [hybrid KV cache manager](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers with the same memory pool, avoiding fragmentation and maximizing GPU utilization.
43
43
44
44
In order to manage state for hybrid models like Qwen3-Next, vLLM automatically tunes the “logical” block size of the full attention layers to ensure that the state for the full attention layers and linear attention layers occupy the same amount of “physical” GPU memory. This enables simple and efficient paged memory management for hybrid models, increasing throughput for heavy workloads when the GPU memory becomes fully utilized.
0 commit comments