You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-05-anatomy-of-vllm.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,7 +94,7 @@ Engine core itself is made up of several sub components:
94
94
<oltype="a">
95
95
<li>policy setting - it can be either <b>FCFS</b> (first come first served) or <b>priority</b> (higher priority requests are served first)</li>
96
96
<li><code>waiting</code> and <code>running</code> queues</li>
97
-
<li>KV cache manager - the heart of paged attention [3]</li>
97
+
<li>KV cache manager - the heart of paged attention [[3]](#ref-3)</li>
98
98
99
99
The KV-cache manager maintains a <code>free_block_queue</code> - a pool of available KV-cache blocks (often on the order of hundreds of thousands, depending on VRAM size and block size). During paged attention, the blocks serve as the indexing structure that map tokens to their computed KV cache blocks.
100
100
@@ -979,14 +979,14 @@ A huge thank you to [Hyperstack](https://www.hyperstack.cloud/) for providing me
979
979
Thanks to [Nick Hill](https://www.linkedin.com/in/nickhillprofile/) (core vLLM contributor, RedHat), [Mark Saroufim](https://x.com/marksaroufim) (PyTorch), [Kyle Krannen](https://www.linkedin.com/in/kyle-kranen/) (NVIDIA, Dynamo), and [Ashish Vaswani](https://www.linkedin.com/in/ashish-vaswani-99892181/) for reading pre-release version of this blog post and providing feedback!
2. <div id="ref-2">"Attention Is All You Need"<a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></div>
984
-
3. <div id="ref-3">"Efficient Memory Management for Large Language Model Serving with PagedAttention"<a href="https://arxiv.org/abs/2309.06180">https://arxiv.org/abs/2309.06180</a></div>
985
-
4. <div id="ref-4">"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"<a href="https://arxiv.org/abs/2405.04434">https://arxiv.org/abs/2405.04434</a></div>
986
-
5. <div id="ref-5">"Jenga: Effective Memory Management for Serving LLM with Heterogeneity"<a href="https://arxiv.org/abs/2503.18292">https://arxiv.org/abs/2503.18292</a></div>
987
-
6. <div id="ref-6">"Orca: A Distributed Serving System for Transformer-Based Generative Models"<a href="https://www.usenix.org/conference/osdi22/presentation/yu">https://www.usenix.org/conference/osdi22/presentation/yu</a></div>
988
-
7. <div id="ref-7">"XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models"<a href="https://arxiv.org/abs/2411.15100">https://arxiv.org/abs/2411.15100</a></div>
989
-
8. <div id="ref-8">"Accelerating Large Language Model Decoding with Speculative Sampling"<a href="https://arxiv.org/abs/2302.01318">https://arxiv.org/abs/2302.01318</a></div>
2. <a id="ref-2"></a>"Attention Is All You Need"<a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a>
984
+
3. <a id="ref-3"></a>"Efficient Memory Management for Large Language Model Serving with PagedAttention"<a href="https://arxiv.org/abs/2309.06180">https://arxiv.org/abs/2309.06180</a>
985
+
4. <a id="ref-4"></a>"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"<a href="https://arxiv.org/abs/2405.04434">https://arxiv.org/abs/2405.04434</a>
986
+
5. <a id="ref-5"></a>"Jenga: Effective Memory Management for Serving LLM with Heterogeneity"<a href="https://arxiv.org/abs/2503.18292">https://arxiv.org/abs/2503.18292</a>
987
+
6. <a id="ref-6"></a>"Orca: A Distributed Serving System for Transformer-Based Generative Models"<a href="https://www.usenix.org/conference/osdi22/presentation/yu">https://www.usenix.org/conference/osdi22/presentation/yu</a>
988
+
7. <a id="ref-7"></a>"XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models"<a href="https://arxiv.org/abs/2411.15100">https://arxiv.org/abs/2411.15100</a>
989
+
8. <a id="ref-8"></a>"Accelerating Large Language Model Decoding with Speculative Sampling"<a href="https://arxiv.org/abs/2302.01318">https://arxiv.org/abs/2302.01318</a>
0 commit comments