update figure

heheda12345 · heheda12345 · commit f0d9ada12254 · 2025-09-11T11:34:51.000-07:00
Signed-off-by: heheda &lt;zhangch99@outlook.com&gt;
diff --git a/_posts/2025-09-11-qwen3-next.md b/_posts/2025-09-11-qwen3-next.md
@@ -39,7 +39,7 @@ At the core of Qwen3-Next is its **Hybrid Attention** design, replacing standard
 
 The model interleaves these two forms of attention across layers, enabling efficient scaling to **65K context lengths** and beyond.
 
-To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [**hybrid KV cache manager**](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers with the same memory pool, avoiding fragmentation and maximizing GPU utilization.
+To support this, vLLM integrates Triton kernels from [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention), and adopts a [hybrid KV cache manager](https://arxiv.org/abs/2503.18292) to manage both linear and full attention layers with the same memory pool, avoiding fragmentation and maximizing GPU utilization.
 
 In order to manage state for hybrid models like Qwen3-Next, vLLM automatically tunes the “logical” block size of the full attention layers to ensure that the state for the full attention layers and linear attention layers occupy the same amount of “physical” GPU memory. This enables simple and efficient paged memory management for hybrid models, increasing throughput for heavy workloads when the GPU memory becomes fully utilized.