Skip to content

Commit 154ff9c

Browse files
committed
Add more code formatting - up to FSM section
1 parent b0e1788 commit 154ff9c

File tree

1 file changed

+25
-25
lines changed

1 file changed

+25
-25
lines changed

_posts/2025-09-05-anatomy-of-vllm.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -220,16 +220,16 @@ We call model executor's <code>execute_model</code>, which delegates to the <cod
220220

221221
Here are the main steps:
222222

223-
1. Update states — prune finished requests from input_batch; update misc fwd pass related metadata (e.g., KV cache blocks per request that will be used to index into paged KV cache memory).
224-
2. Prepare inputs — copy buffers from CPU→GPU; compute positions; build slot_mapping (more on that in example); construct attention metadata.
225-
3. Forward pass — run the model with custom paged attn kernels. All sequences are flattened and concatenated into one long "super sequence". Position indices and attention masks ensure each sequence only attends to its own tokens, which enables continuous batching without right-padding.
226-
4. Gather last-token states — extract hidden states for each sequence's final position and compute logits.
227-
5. Sample — sample tokens from computed logits as dictated by the sampling config (greedy, temperature, top-p, top-k, etc.).
223+
1. <b>Update states</b> — prune finished requests from <code>input_batch</code>; update misc fwd pass related metadata (e.g., KV cache blocks per request that will be used to index into paged KV cache memory).
224+
2. <b>Prepare inputs</b> — copy buffers from CPU→GPU; compute positions; build <code>slot_mapping</code> (more on that in example); construct attention metadata.
225+
3. <b>Forward pass</b> — run the model with custom paged attn kernels. All sequences are flattened and concatenated into one long "super sequence". Position indices and attention masks ensure each sequence only attends to its own tokens, which enables continuous batching without right-padding.
226+
4. <b>Gather last-token states</b> — extract hidden states for each sequence's final position and compute logits.
227+
5. <b>Sample</b> — sample tokens from computed logits as dictated by the sampling config (greedy, temperature, top-p, top-k, etc.).
228228

229229
Forward-pass step itself has two execution modes:
230230

231-
1. Eager mode — run the standard PyTorch forward pass when eager execution is enabled.
232-
2. "Captured" mode — execute/reply a pre-captured CUDA Graph when eager is not enforced (remember we captured these during engine construction in the initialize KV cache procedure).
231+
1. <b>Eager mode</b> — run the standard PyTorch forward pass when eager execution is enabled.
232+
2. <b>"Captured" mode</b> — execute/reply a pre-captured CUDA Graph when eager is not enforced (remember we captured these during engine construction in the initialize KV cache procedure).
233233

234234
Here is a concrete example that should make continuous batching and paged attention clear:
235235

@@ -258,7 +258,7 @@ Next, we'll dive into:
258258

259259
Chunked prefill is a technique for handling long prompts by splitting their prefill step into smaller chunks. Without it, we could end up with a single very long request monopolizing one engine step disallowing other prefill requests to run. That would postpone all other requests and increase their latency.
260260

261-
For example, let each chunk contain n (=8) tokens, labeled with lowercase letters separated by "-". A long prompt P could look like x-y-z, where z is an incomplete chunk (e.g. 2 toks). Executing the full prefill for P would then take ≥ 3 engine steps (> can happen if it's not scheduled for execution in one of the steps), and only in the last chunked prefill step would we sample one new token.
261+
For example, let each chunk contain <code>n</code> (=8) tokens, labeled with lowercase letters separated by "-". A long prompt <code>P</code> could look like <code>x-y-z</code>, where <code>z</code> is an incomplete chunk (e.g. 2 toks). Executing the full prefill for <code>P</code> would then take ≥ 3 engine steps (> can happen if it's not scheduled for execution in one of the steps), and only in the last chunked prefill step would we sample one new token.
262262

263263
Here is that same example visually:
264264

@@ -269,9 +269,9 @@ Here is that same example visually:
269269
<b>Figure 5</b>: Chunked prefill
270270
</p>
271271

272-
Implementation is straightforward: cap the number of new tokens per step. If the requested number exceeds long_prefill_token_threshold, reset it to exactly that value. The underlying indexing logic (described earlier) takes care of the rest.
272+
Implementation is straightforward: cap the number of new tokens per step. If the requested number exceeds <code>long_prefill_token_threshold</code>, reset it to exactly that value. The underlying indexing logic (described earlier) takes care of the rest.
273273

274-
In vLLM V1, you enable chunked prefill by setting long_prefill_token_threshold to a positive integer. (Technically, it can happen irrespective of this, if the prompt length exceeds the token budget we truncate it and run a chunked prefill.)
274+
In vLLM V1, you enable chunked prefill by setting <code>long_prefill_token_threshold</code> to a positive integer. (Technically, it can happen irrespective of this, if the prompt length exceeds the token budget we truncate it and run a chunked prefill.)
275275

276276
### Prefix Caching
277277

@@ -299,29 +299,29 @@ if __name__ == "__main__":
299299
main()
300300
```
301301

302-
Prefix caching avoids recomputing tokens that multiple prompts share at the beginning - hence prefix.
302+
Prefix caching avoids recomputing tokens that multiple prompts share at the beginning - hence <b>prefix</b>.
303303

304-
The crucial piece is the long_prefix: it's defined as any prefix longer than a KV-cache block (16 tokens by default). To simplify our example let's say long_prefix has exactly length n x block_size (where n ≥ 1).
304+
The crucial piece is the <code>long_prefix</code>: it's defined as any prefix longer than a KV-cache block (16 tokens by default). To simplify our example let's say <code>long_prefix</code> has exactly length <code>n x block_size</code> (where <code>n ≥ 1</code>).
305305

306306
> [!NOTE]
307-
> i.e. it perfectly aligns with block boundary - otherwise we'd have to recompute long_prefix_len % block_size tokens as we can't cache incomplete blocks.
307+
> i.e. it perfectly aligns with block boundary - otherwise we'd have to recompute <code>long_prefix_len % block_size</code> tokens as we can't cache incomplete blocks.
308308
309-
Without prefix caching, each time we process a new request with the same long_prefix, we'd recompute all n x block_size tokens.
309+
Without prefix caching, each time we process a new request with the same <code>long_prefix</code>, we'd recompute all <code>n x block_size</code> tokens.
310310

311311
With prefix caching, those tokens are computed once (their KVs stored in KV cache paged memory) and then reused, so only the new prompt tokens need processing. This speeds up prefill requests (though it doesn't help with decode).
312312

313313
How does this work in vLLM?
314314

315-
During the first generate call, in the scheduling stage, inside kv_cache_manager.get_computed_blocks, the engine invokes hash_request_tokens:
315+
During the first <code>generate</code> call, in the scheduling stage, inside <code>kv_cache_manager.get_computed_blocks</code>, the engine invokes <code>hash_request_tokens</code>:
316316

317-
1. This function splits the long_prefix + prompts[0] into 16-token chunks.
317+
1. This function splits the <code>long_prefix + prompts[0]</code> into 16-token chunks.
318318
2. For each complete chunk, it computes a hash (using either the built-in hash or SHA-256, which is slower but has fewer collisions). The hash combines the previous block's hash, the current tokens, and optional metadata.
319319
> [!NOTE] optional metadata includes: MM hash, LoRA ID, cache salt (injected into hash of the first block ensures only requests with this cache salt can reuse blocks).
320-
3. Each result is stored as a BlockHash object containing both the hash and its token IDs. We return a list of block hashes.
320+
3. Each result is stored as a <code>BlockHash</code> object containing both the hash and its token IDs. We return a list of block hashes.
321321
322-
The list is stored in self.req_to_block_hashes[request_id].
322+
The list is stored in <code>self.req_to_block_hashes[request_id]</code>.
323323

324-
Next, the engine calls find_longest_cache_hit to check if any of these hashes already exist in cached_block_hash_to_block. On the first request, no hits are found.
324+
Next, the engine calls <code>find_longest_cache_hit</code> to check if any of these hashes already exist in <code>cached_block_hash_to_block</code>. On the first request, no hits are found.
325325

326326
<p align="center">
327327
<picture>
@@ -330,12 +330,12 @@ Next, the engine calls find_longest_cache_hit to check if any of these hashes al
330330
<b>Figure 6</b>: Prefix caching - hash function
331331
</p>
332332

333-
Then we call allocate_slots which calls coordinator.cache_blocks, which associates the new BlockHash entries with allocated KV blocks and records them in cached_block_hash_to_block.
333+
Then we call <code>allocate_slots</code> which calls <code>coordinator.cache_blocks</code>, which associates the new <code>BlockHash</code> entries with allocated KV blocks and records them in <code>cached_block_hash_to_block</code>.
334334

335335
Afterwards, the forward pass will populate KVs in paged KV cache memory corresponding to KV cache blocks that we allocated above.
336336

337337
> [!NOTE]
338-
> After many engine steps it'll allocate more KV cache blocks but it doesn't matter for our example because the prefix has diverged immediately after long_prefix.
338+
> After many engine steps it'll allocate more KV cache blocks but it doesn't matter for our example because the prefix has diverged immediately after <code>long_prefix</code>.
339339
340340
<p align="center">
341341
<picture>
@@ -344,7 +344,7 @@ Afterwards, the forward pass will populate KVs in paged KV cache memory correspo
344344
<b>Figure 7</b>: Prefix caching - populate KVs in paged memory
345345
</p>
346346

347-
On a second generate call with the same prefix, steps 1-3 repeat, but now find_longest_cache_hit finds matches for all n blocks (via linear search). The engine can reuse those KV blocks directly.
347+
On a second <code>generate</code> call with the same prefix, steps 1-3 repeat, but now <code>find_longest_cache_hit</code> finds matches for all n blocks (via linear search). The engine can reuse those KV blocks directly.
348348

349349
<p align="center">
350350
<picture>
@@ -353,16 +353,16 @@ On a second generate call with the same prefix, steps 1-3 repeat, but now find_l
353353
<b>Figure 8</b>: Prefix caching - reuse KVs
354354
</p>
355355

356-
If the original request were still alive, the reference count for those blocks would increment (e.g. to 2). In this example, the first request has already completed, so the blocks were freed back to the pool and their reference counts set back to 0. Because we were able to retrieve them from cached_block_hash_to_block we know they're valid (the logic of the KV cache manager is setup in such a way), so we just remove them from free_block_queue again.
356+
If the original request were still alive, the reference count for those blocks would increment (e.g. to 2). In this example, the first request has already completed, so the blocks were freed back to the pool and their reference counts set back to 0. Because we were able to retrieve them from <code>cached_block_hash_to_block</code> we know they're valid (the logic of the KV cache manager is setup in such a way), so we just remove them from <code>free_block_queue</code> again.
357357

358358
> [!NOTE] Advanced note:
359-
> KV-cache blocks become invalid only when they're about to be reallocated from the free_block_queue (which pops from the left) and we discover the block still has an associated hash and is present in cached_block_hash_to_block. At that moment, we clear the block's hash and remove its entry from cached_block_hash_to_block, ensuring it can't be reused via prefix caching (at least not for that old prefix).
359+
> KV-cache blocks become invalid only when they're about to be reallocated from the <code>free_block_queue</code> (which pops from the left) and we discover the block still has an associated hash and is present in <code>cached_block_hash_to_block</code>. At that moment, we clear the block's hash and remove its entry from <code>cached_block_hash_to_block</code>, ensuring it can't be reused via prefix caching (at least not for that old prefix).
360360
361361
And that's the gist of prefix caching: don't recompute prefixes you've already seen — just reuse their KV cache!
362362

363363
If you understood this example you also understood how paged attention works.
364364

365-
Prefix caching is enabled by default. To disable it: enable_prefix_caching = False.
365+
Prefix caching is enabled by default. To disable it: <code>enable_prefix_caching = False</code>.
366366

367367
### Guided Decoding (FSM)
368368

0 commit comments

Comments
 (0)