Skip to content

Commit 01e9c05

Browse files
committed
Fix few typos
1 parent d379079 commit 01e9c05

File tree

1 file changed

+14
-9
lines changed

1 file changed

+14
-9
lines changed

_posts/2025-09-05-anatomy-of-vllm.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ After that, it processes prefill requests from the <code>waiting</code> queue, i
201201

202202
Let's now look at what <code>allocate_slots</code> does, it:
203203

204-
1. <b>Computes number of blocks</b> — determines how many new KV-cache blocks (n) must be allocated. Each block stores 16 tokens by default. For example, if a prefill request has 17 new tokens, we need <code>ceil(17/16) = 2</code> blocks.
204+
1. <b>Computes number of blocks</b> — determines how many new KV-cache blocks (<code>n</code>) must be allocated. Each block stores 16 tokens by default. For example, if a prefill request has 17 new tokens, we need <code>ceil(17/16) = 2</code> blocks.
205205
2. <b>Checks availability</b> — if there aren't enough blocks in the manager's pool, exit early. Depending on whether it's a decode or prefill request, the engine may attempt recompute preemption (swap preemption was supported in V0) by evicting low-priority requests (calling <code>kv_cache_manager.free</code> which returns KV blocks to block pool), or it might skip scheduling and continue execution.
206206
3. <b>Allocates blocks</b> — via the KV-cache manager's coordinator, fetches the first <code>n</code> blocks from the block pool (the <code>free_block_queue</code> doubly linked list mentioned earlier). Stores to <code>req_to_blocks</code>, the dictionary mapping each <code>request_id</code> to its list of KV-cache blocks.
207207

@@ -344,7 +344,7 @@ Afterwards, the forward pass will populate KVs in paged KV cache memory correspo
344344
<b>Figure 7</b>: Prefix caching - populate KVs in paged memory
345345
</p>
346346

347-
On a second <code>generate</code> call with the same prefix, steps 1-3 repeat, but now <code>find_longest_cache_hit</code> finds matches for all n blocks (via linear search). The engine can reuse those KV blocks directly.
347+
On a second <code>generate</code> call with the same prefix, steps 1-3 repeat, but now <code>find_longest_cache_hit</code> finds matches for all <code>n</code> blocks (via linear search). The engine can reuse those KV blocks directly.
348348

349349
<p align="center">
350350
<picture>
@@ -434,18 +434,23 @@ You can enable this in vLLM by passing in a desired <code>guided_decoding</code>
434434

435435
In autoregressive generation, each new token requires a forward pass of the large LM. This is expensive — every step reloads and applies all model weights just to compute a single token! (assuming batch size == 1, in general it's <code>B</code>)
436436

437-
Speculative decoding [8] speeds this up by introducing a smaller draft LM. The draft proposes k tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
437+
Speculative decoding [8] speeds this up by introducing a smaller draft LM. The draft proposes <code>k</code> tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
438438

439439
Here are the steps:
440440

441441
1. <b>Draft</b>: run the small model on the current context and propose <code>k</code> tokens
442442
2. <b>Verify</b>: run the large model once on context + <code>k</code> draft tokens. This produces probabilities for those <code>k</code> positions plus one extra (so we get <code>k+1</code> candidates)
443443
3. <b>Accept/reject</b>: going from left to right over the <code>k</code> draft tokens:
444-
* If the large model's probability for the draft token ≥ the draft's probability, accept it
445-
* Otherwise, accept it with probability <code>p_large(token)/p_draft(token)</code>
446-
* Stop at the first rejection, or accept all <code>k</code> draft tokens.
447-
* If all <code>k</code> draft tokens are accepted, also sample the extra <code>(k+1)</code>-th token "for free" from the large model (we already computed that distribution).
448-
* If there was a rejection create a new rebalanced distribution at that position (<code>p_large - p_draft</code>, clamp min at 0, normalize to sum to 1) and sample the last token from it.
444+
<ul>
445+
<li>If the large model's probability for the draft token ≥ the draft's probability, accept it</li>
446+
<li>Otherwise, accept it with probability <code>p_large(token)/p_draft(token)</code></li>
447+
<li>Stop at the first rejection, or accept all <code>k</code> draft tokens</li>
448+
<ul>
449+
<li>If all <code>k</code> draft tokens are accepted, also sample the extra <code>(k+1)</code>-th token "for free" from the large model (we already computed that distribution)</li>
450+
<li>If there was a rejection create a new rebalanced distribution at that position (<code>p_large - p_draft</code>, clamp min at 0, normalize to sum to 1) and sample the last token from it</li>
451+
</ul>
452+
</ul>
453+
449454

450455
<b>Why this works</b>: Although we use the small model to propose candidates, the accept/reject rule guarantees that in expectation the sequence is distributed exactly as if we had sampled token by token from the large model. This means speculative decoding is statistically equivalent to standard autoregressive decoding — but potentially much faster, since a single large-model pass can yield up to <code>k+1</code> tokens.
451456

@@ -955,7 +960,7 @@ We began with the basic engine core (<code>UniprocExecutor</code>), added advanc
955960
vLLM also includes specialized handling that I've skipped. E.g.:
956961

957962
* <b>Custom hardware backends</b>: TPUs, AWS Neuron (Trainium/Inferentia), etc.
958-
* <b>Architectures/techniques</b>: <code>MLA</code>, <code>MoE</code>, encoder-decoder (e.g., Whisper), pooling/embedding models, <code>EPLB</code>, <code>m-RoPE</code>, <code>LoRA</coder>, <code>ALiBi</code>, attention-free variants, sliding-window attention, multimodal LMs, and state-space models (e.g., Mamba/Mamba-2, Jamba)
963+
* <b>Architectures/techniques</b>: <code>MLA</code>, <code>MoE</code>, encoder-decoder (e.g., Whisper), pooling/embedding models, <code>EPLB</code>, <code>m-RoPE</code>, <code>LoRA</code>, <code>ALiBi</code>, attention-free variants, sliding-window attention, multimodal LMs, and state-space models (e.g., Mamba/Mamba-2, Jamba)
959964
* <b>TP/PP/SP</b>
960965
* <b>Hybrid KV-cache logic</b> (Jenga), more complex sampling methods like beam sampling, and more
961966
* <b>Experimental</b>: async scheduling

0 commit comments

Comments
 (0)