You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-05-anatomy-of-vllm.md
+35-35Lines changed: 35 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,15 +88,15 @@ The main components of the engine are:
88
88
89
89
Engine core itself is made up of several sub components:
90
90
91
-
* Model Executor (drives forward passes on the model, we're currently dealing with UniProcExecutor which has a single Worker process on a single GPU). We'll gradually build up to MultiProcExecutor which supports multiple GPUs
91
+
* Model Executor (drives forward passes on the model, we're currently dealing with <code>UniProcExecutor</code> which has a single <code>Worker</code> process on a single GPU). We'll gradually build up to <code>MultiProcExecutor</code> which supports multiple GPUs
92
92
* Structured Output Manager (used for guided decoding - we'll cover this later)
93
93
* Scheduler (decides which requests go into the next engine step) - it further contains:
94
94
<oltype="a">
95
-
<li>policy setting - it can be either FCFS (first come first served) or priority (higher priority requests are served first)</li>
96
-
<li>waiting and running queues</li>
95
+
<li>policy setting - it can be either <b>FCFS</b> (first come first served) or <b>priority</b> (higher priority requests are served first)</li>
96
+
<li><code>waiting</code> and <code>running</code> queues</li>
97
97
<li>KV cache manager - the heart of paged attention [3]</li>
98
98
99
-
The KV-cache manager maintains a free_block_queue - a pool of available KV-cache blocks (often on the order of hundreds of thousands, depending on VRAM size and block size). During paged attention, the blocks serve as the indexing structure that map tokens to their computed KV cache blocks.
99
+
The KV-cache manager maintains a <code>free_block_queue</code> - a pool of available KV-cache blocks (often on the order of hundreds of thousands, depending on VRAM size and block size). During paged attention, the blocks serve as the indexing structure that map tokens to their computed KV cache blocks.
100
100
101
101
<palign="center">
102
102
<picture>
@@ -107,16 +107,16 @@ The KV-cache manager maintains a free_block_queue - a pool of available KV-cache
107
107
108
108
> [!NOTE]
109
109
> Block size for a standard transformer layer (non-MLA [4]) is computed as follows:
During model executor construction, a Worker object is created, and three key procedures are executed. (Later, with MultiProcExecutor, these same procedures run independently on each worker process across different GPUs.)
112
+
During model executor construction, a <code>Worker</code> object is created, and three key procedures are executed. (Later, with <code>MultiProcExecutor</code>, these same procedures run independently on each worker process across different GPUs.)
113
113
114
114
1. Init device:
115
115
* Assign a CUDA device (e.g. "cuda:0") to the worker and check that the model dtype is supported (e.g. bf16)
116
-
* Verify enough VRAM is available, given the requested gpu_memory_utilization (e.g. 0.8 → 80% of total VRAM)
116
+
* Verify enough VRAM is available, given the requested <code>gpu_memory_utilization</code> (e.g. 0.8 → 80% of total VRAM)
117
117
* Set up distributed settings (DP / TP / PP / EP, etc.)
118
-
* Instantiate a model_runner (holds the sampler, KV cache, and forward-pass buffers such as input_ids, positions, etc.)
119
-
* Instantiate an InputBatch object (holds CPU-side forward-pass buffers, block tables for KV-cache indexing, sampling metadata, etc.)
118
+
* Instantiate a <code>model_runner</code> (holds the sampler, KV cache, and forward-pass buffers such as <code>input_ids</code>, <code>positions</code>, etc.)
119
+
* Instantiate an <code>InputBatch</code> object (holds CPU-side forward-pass buffers, block tables for KV-cache indexing, sampling metadata, etc.)
120
120
121
121
2. Load model:
122
122
* Instantiate the model architecture
@@ -125,42 +125,42 @@ During model executor construction, a Worker object is created, and three key pr
125
125
* Optional: call torch.compile() on the model
126
126
127
127
3. Initialize KV cache
128
-
* Get per-layer KV-cache spec. Historically this was always FullAttentionSpec (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see Jenga [5])
128
+
* Get per-layer KV-cache spec. Historically this was always <code>FullAttentionSpec</code> (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see Jenga [5])
129
129
* Run a dummy/profiling forward pass and take a GPU memory snapshot to compute how many KV cache blocks fit in available VRAM
130
130
* Allocate, reshape and bind KV cache tensors to attention layers
131
131
* Prepare attention metadata (e.g. set the backend to FlashAttention) later consumed by kernels during the fwd pass
132
-
* Unless --enforce-eager is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG. Later during fwd pass we launch/reply pre-baked graphs and cut on kernel launch overhead and thus improve latency.
132
+
* Unless <code>--enforce-eager</code> is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG. Later during fwd pass we launch/reply pre-baked graphs and cut on kernel launch overhead and thus improve latency.
133
133
134
134
I've abstracted away many low-level details here — but these are the core pieces I'll introduce now, since I'll reference them repeatedly in the following sections.
135
135
136
-
Now that we have the engine initialized let's proceed to the generate function.
136
+
Now that we have the engine initialized let's proceed to the <code>generate</code> function.
137
137
138
138
### Generate function
139
139
140
140
The first step is to validate and feed requests into the engine. For each prompt we:
141
141
142
142
1. Create a unique request ID and capture its arrival time
143
-
2. Call an input preprocessor that tokenizes the prompt and returns a dictionary containing prompt, prompt_token_ids, and a type (text, tokens, embeds, etc.)
144
-
3. Pack this info into an EngineCoreRequest, adding priority, sampling params, and other metadata
145
-
4. Pass the request into the engine core, which wraps it in a Request object and sets its status to WAITING. This request is then added to the scheduler's waiting queue (append if FCFS, or heap-push if priority)
143
+
2. Call an input preprocessor that tokenizes the prompt and returns a dictionary containing <code>prompt</code>, <code>prompt_token_ids</code>, and a <code>type</code> (text, tokens, embeds, etc.)
144
+
3. Pack this info into an <code>EngineCoreRequest</code>, adding priority, sampling params, and other metadata
145
+
4. Pass the request into the engine core, which wraps it in a <code>Request</code> object and sets its status to <code>WAITING</code>. This request is then added to the scheduler's <code>waiting</code> queue (append if FCFS, or heap-push if priority)
146
146
147
-
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka continuous batching [6]): after each step, both new and old requests are considered.
147
+
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka <b>continuous batching</b>[6]): after each step, both new and old requests are considered.
148
148
149
149
> [!NOTE]
150
150
> Because the forward pass flattens the batch into a single sequence and custom kernels handle it efficiently, continuous batching is fundamentally supported even in the synchronous engine.
151
151
152
-
Next, as long as there are requests to process, the engine repeatedly calls its step() function. Each step has three stages:
152
+
Next, as long as there are requests to process, the engine repeatedly calls its <code>step()</code> function. Each step has three stages:
153
153
154
154
1. Schedule: select which requests to run in this step (decode, and/or (chunked) prefill)
155
155
2. Forward pass: run the model and sample tokens
156
-
3. Postprocess: append sampled token IDs to each Request, detokenize, and check stop conditions. If a request is finished, clean up (e.g. return its KV-cache blocks to free_block_queue) and return the output early
156
+
3. Postprocess: append sampled token IDs to each <code>Request</code>, detokenize, and check stop conditions. If a request is finished, clean up (e.g. return its KV-cache blocks to <code>free_block_queue</code>) and return the output early
157
157
158
158
> [!NOTE]
159
159
> Stop conditions are:
160
-
> * The request exceeds its length limit (max_model_length or its own max_tokens)
161
-
> * The sampled token is the EOS ID (unless ignore_eos is enabled -> useful for benchmarking when we want to force a generation of a certain number of out tokens)
162
-
> * The sampled token matches any of the stop_token_ids specified in the sampling parameters
163
-
> * Stop strings are present in the output - we truncate the output until the first stop string appearance and abort the request in the engine (note that stop_token_ids will be present in the output but stop strings will not).
160
+
> * The request exceeds its length limit (<code>max_model_length</code> or its own <code>max_tokens</code>)
161
+
> * The sampled token is the EOS ID (unless <code>ignore_eos</code> is enabled -> useful for benchmarking when we want to force a generation of a certain number of out tokens)
162
+
> * The sampled token matches any of the <code>stop_token_ids</code> specified in the sampling parameters
163
+
> * Stop strings are present in the output - we truncate the output until the first stop string appearance and abort the request in the engine (note that <code>stop_token_ids</code> will be present in the output but stop strings will not).
164
164
165
165
<palign="center">
166
166
<picture>
@@ -178,32 +178,32 @@ Next, we'll examine scheduling in more detail.
178
178
179
179
There are two main types of workloads an inference engine handles:
180
180
181
-
1. Prefill requests — a forward pass over all prompt tokens. These are usually compute-bound (threshold depends on hardware and prompt length). At the end, we sample a single token from the probability distribution of the final token's position.
182
-
2. Decode requests — a forward pass over just the most recent token. All earlier KV vectors are already cached. These are memory-bandwidth-bound, since we still need to load all LLM weights (and KV caches) just to compute one token.
181
+
1.<b>Prefill</b> requests — a forward pass over all prompt tokens. These are usually <b>compute-bound</b> (threshold depends on hardware and prompt length). At the end, we sample a single token from the probability distribution of the final token's position.
182
+
2.<b>Decode</b> requests — a forward pass over just the most recent token. All earlier KV vectors are already cached. These are <b>memory-bandwidth-bound</b>, since we still need to load all LLM weights (and KV caches) just to compute one token.
183
183
184
184
> [!NOTE]
185
-
> In the benchmarking section we'll analyze the so-called roofline model of GPU perf. That will go into more detail behind prefill/decode perf profiles.
185
+
> In the [benchmarking section](#benchmarks-and-auto-tuning---latency-vs-throughput) we'll analyze the so-called roofline model of GPU perf. That will go into more detail behind prefill/decode perf profiles.
186
186
187
187
The V1 scheduler can mix both types of requests in the same step, thanks to smarter design choices. In contrast, the V0 engine could only process either prefill or decode at once.
188
188
189
-
The scheduler prioritizes decode requests — i.e. those already in the running queue. For each such request it:
189
+
The scheduler prioritizes decode requests — i.e. those already in the <code>running</code> queue. For each such request it:
190
190
191
191
1. Computes the number of new tokens to generate (not always 1, due to speculative decoding and async scheduling — more on that later).
192
-
2. Calls the KV-cache manager's allocate_slots function (details below).
192
+
2. Calls the KV-cache manager's <code>allocate_slots</code> function (details below).
193
193
3. Updates the token budget by subtracting the number of tokens from step 1.
194
194
195
-
After that, it processes prefill requests from the waiting queue, it:
195
+
After that, it processes prefill requests from the <code>waiting</code> queue, it:
196
196
197
197
1. Retrieves the number of computed blocks (returns 0 if prefix caching is disabled — we'll cover that later).
198
-
2. Calls the KV-cache manager's allocate_slots function.
199
-
3. Pops the request from waiting and moves it to running, setting its status to RUNNING.
198
+
2. Calls the KV-cache manager's <code>allocate_slots</code> function.
199
+
3. Pops the request from waiting and moves it to running, setting its status to <code>RUNNING</code>.
200
200
4. Updates the token budget.
201
201
202
-
Let's now look at what allocate_slots does, it:
202
+
Let's now look at what <code>allocate_slots</code> does, it:
203
203
204
-
1. Computes number of blocks — determines how many new KV-cache blocks (n) must be allocated. Each block stores 16 tokens by default. For example, if a prefill request has 17 new tokens, we need ceil(17/16) = 2 blocks.
205
-
2. Checks availability — if there aren't enough blocks in the manager's pool, exit early. Depending on whether it's a decode or prefill request, the engine may attempt recompute preemption (swap preemption was supported in V0) by evicting low-priority requests (calling kv_cache_manager.free which returns KV blocks to block pool), or it might skip scheduling and continue execution.
206
-
3. Allocates blocks — via the KV-cache manager's coordinator, fetches the first n blocks from the block pool (the free_block_queue doubly linked list mentioned earlier). Stores to req_to_blocks, the dictionary mapping each request_id to its list of KV-cache blocks.
204
+
1.<b>Computes number of blocks</b> — determines how many new KV-cache blocks (n) must be allocated. Each block stores 16 tokens by default. For example, if a prefill request has 17 new tokens, we need <code>ceil(17/16) = 2</code> blocks.
205
+
2.<b>Checks availability</b> — if there aren't enough blocks in the manager's pool, exit early. Depending on whether it's a decode or prefill request, the engine may attempt recompute preemption (swap preemption was supported in V0) by evicting low-priority requests (calling <code>kv_cache_manager.free</code> which returns KV blocks to block pool), or it might skip scheduling and continue execution.
206
+
3.<b>Allocates blocks</b> — via the KV-cache manager's coordinator, fetches the first <code>n</code> blocks from the block pool (the <code>free_block_queue</code> doubly linked list mentioned earlier). Stores to <code>req_to_blocks</code>, the dictionary mapping each <code>request_id</code> to its list of KV-cache blocks.
207
207
208
208
<palign="center">
209
209
<picture>
@@ -216,7 +216,7 @@ We're finally ready to do a forward pass!
216
216
217
217
### Run forward pass
218
218
219
-
We call model executor's execute_model, which delegates to the Worker, which in turn delegates to the model runner.
219
+
We call model executor's <code>execute_model</code>, which delegates to the <code>Worker</code>, which in turn delegates to the model runner.
0 commit comments