Skip to content

Commit fc3e742

Browse files
committed
update
Signed-off-by: heheda <[email protected]>
1 parent f0d9ada commit fc3e742

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_posts/2025-09-11-qwen3-next.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ At the core of Qwen3-Next is its **Hybrid Attention** design, replacing standard
3939

4040
The model interleaves these two forms of attention across layers, enabling efficient scaling to **65K context lengths** and beyond.
4141

42-
To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [hybrid KV cache manager](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers with the same memory pool, avoiding fragmentation and maximizing GPU utilization.
42+
To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [hybrid KV cache manager](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers, avoiding fragmentation and maximizing GPU utilization.
4343

4444
In order to manage state for hybrid models like Qwen3-Next, vLLM automatically tunes the “logical” block size of the full attention layers to ensure that the state for the full attention layers and linear attention layers occupy the same amount of “physical” GPU memory. This enables simple and efficient paged memory management for hybrid models, increasing throughput for heavy workloads when the GPU memory becomes fully utilized.
4545

0 commit comments

Comments
 (0)