You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
3.[Scaling up](#from-uniprocexecutor-to-multiprocexecutor): from single-GPU to multi-GPU execution
24
+
4.[Serving layer](#distributed-system-serving-vllm): distributed / concurrent web scaffolding
25
+
5.[Benchmarks and auto-tuning](#benchmarks-and-auto-tuning---latency-vs-throughput): measuring latency and throughput
26
26
27
27
> [!NOTE]
28
28
> * Analysis is based on [commit 42172ad](https://github.com/vllm-project/vllm/tree/42172ad) (August 9th, 2025).
@@ -80,9 +80,9 @@ Let's start analyzing the constructor.
80
80
The main components of the engine are:
81
81
82
82
* vLLM config (contains all of the knobs for configuring model, cache, parallelism, etc.)
83
-
* processor (turns raw inputs → EngineCoreRequests via validation, tokenization, and processing)
84
-
* engine core client (in our running example we're using InprocClient which is basically == EngineCore; we'll gradually build up to DPLBAsyncMPClient which allows serving at scale)
85
-
* output processor (converts raw EngineCoreOutputs → RequestOutput that the user sees)
83
+
* processor (turns raw inputs → <code>EngineCoreRequests</code> via validation, tokenization, and processing)
84
+
* engine core client (in our running example we're using <code>InprocClient</code> which is basically == <code>EngineCore</code>; we'll gradually build up to <code>DPLBAsyncMPClient</code> which allows serving at scale)
85
+
* output processor (converts raw <code>EngineCoreOutputs</code> → <code>RequestOutput</code> that the user sees)
86
86
> [!NOTE]
87
87
> With the V0 engine being deprecated, class names and details may shift. I'll emphasize the core ideas rather than exact signatures. I'll abstract away some but not all of those details.
0 commit comments