Skip to content

Commit b51423e

Browse files
committed
Few minor fixes
1 parent 01e9c05 commit b51423e

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

_posts/2025-09-05-anatomy-of-vllm.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ image: /assets/logos/vllm-logo-text-light.png
88
> [!NOTE]
99
> Originally posted on [Aleksa Gordic's website](https://www.aleksagordic.com/blog/vllm).
1010
11-
## From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale
11+
### From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale
1212

1313
In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how vLLM [1] works.
1414

@@ -107,7 +107,7 @@ The KV-cache manager maintains a <code>free_block_queue</code> - a pool of avail
107107

108108
> [!NOTE]
109109
> Block size for a standard transformer layer (non-MLA [4]) is computed as follows:
110-
2 * <code>block_size</code> (default=16) * <code>num_kv_heads</code> * <code>head_size</code> * <code>dtype_num_bytes</code> (2 for bf16)
110+
> 2 * <code>block_size</code> (default=16) * <code>num_kv_heads</code> * <code>head_size</code> * <code>dtype_num_bytes</code> (2 for bf16)
111111
112112
During model executor construction, a <code>Worker</code> object is created, and three key procedures are executed. (Later, with <code>MultiProcExecutor</code>, these same procedures run independently on each worker process across different GPUs.)
113113

0 commit comments

Comments
 (0)