vllm-project · simon-mo · Oct 6, 2025 · simon-mo · Oct 6, 2025 · simon-mo
diff --git a/docs/configuration/conserving_memory.md b/docs/configuration/conserving_memory.md
@@ -53,7 +53,7 @@ llm = LLM(model="adept/fuyu-8b",
 By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
 
 !!! warning
-    CUDA graph capture takes up more memory in V1 than in V0.
+    CUDA graph capture increases GPU memory usage. Adjust capture sizes if you need to conserve memory.
 
 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
 

diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md
@@ -33,7 +33,7 @@ In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as re
 
 Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
 
-In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.
+In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models.
-In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models.
+In vLLM V1, **chunked prefill is always enabled by default**.
-In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models.
+In vLLM V1, **chunked prefill is always enabled by default**.
 
 With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
 
@@ -49,7 +49,7 @@ You can tune the performance by adjusting `max_num_batched_tokens`:
 - Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
 - Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
 - For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
-- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
+- If `max_num_batched_tokens` is the same as `max_model_len`, the scheduler behaves similarly to the legacy policy where large prefills ran without chunking (while still prioritizing decodes).
 
 ```python
 from vllm import LLM

diff --git a/docs/contributing/model/basic.md b/docs/contributing/model/basic.md
@@ -133,8 +133,7 @@ We consider 3 different scenarios:
 For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](gh-file:vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](gh-file:vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference.
 The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config.
 For the mamba layers themselves, please use the [`MambaMixer`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes.
-Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations.
-V0-only classes and code will be removed in the very near future.
+Please avoid reintroducing legacy cache managers such as `MambaCacheManager` or any previously removed code paths from older implementations.
 The model should also be added to the `MODELS_CONFIG_MAP` dictionary in <gh-file:vllm/model_executor/models/config.py> to ensure that the runtime defaults are optimized.
 
 For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](gh-file:vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](gh-file:vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together).

diff --git a/docs/design/metrics.md b/docs/design/metrics.md
@@ -1,12 +1,12 @@
 # Metrics
 
-Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
+vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine.
 
 ## Objectives
 
-- Achieve parity of metrics between v0 and v1.
-- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
-- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
+- Provide comprehensive coverage of engine and request level metrics to aid production monitoring.
+- Prioritize Prometheus integrations, as this is what we expect to be used in production environments.
+- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases.
 
 ## Background
 
@@ -17,9 +17,9 @@ Metrics in vLLM can be categorized as follows:
 
 The mental model is that server-level metrics help explain the values of request-level metrics.
 
-### v0 Metrics
+### Metrics Overview
 
-In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
+The following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix and are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md):
 
 - `vllm:num_requests_running` (Gauge)
 - `vllm:num_requests_swapped` (Gauge)
@@ -57,8 +57,6 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
 - `vllm:spec_decode_num_draft_tokens_total` (Counter)
 - `vllm:spec_decode_num_emitted_tokens_total` (Counter)
 
-These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md).
-
 ### Grafana Dashboard
 
 vLLM also provides [a reference example](../examples/online_serving/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
@@ -86,7 +84,7 @@ See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful b
 
 Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs.
 
-With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657):
+During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657):
 
 ```bash
 $ curl http://0.0.0.0:8000/metrics 2>/dev/null  | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
@@ -97,10 +95,6 @@ http_request_duration_highr_seconds_count 201.0
 http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201.0
 ```
 
-### Multi-process Mode
-
-In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <gh-pr:7279>.
-
 ### Built in Python/Process Metrics
 
 The following metrics are supported by default by `prometheus_client`, but they are not exposed when multiprocess mode is used:
@@ -116,22 +110,7 @@ The following metrics are supported by default by `prometheus_client`, but they
 - `process_open_fds`
 - `process_max_fds`
 
-This is relevant because if we move away from multiprocess mode in v1,
-we get these back. However, it's questionable how relevant these are
-if they don't aggregate these stats for all processes that make up a
-vLLM instance.
-
-### v0 PRs and Issues
-
-For background, these are some of the relevant PRs which added the v0 metrics:
-
-- <gh-pr:1890>
-- <gh-pr:2316>
-- <gh-pr:2730>
-- <gh-pr:4464>
-- <gh-pr:7279>
-
-Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781).
+This is relevant because if we move away from multiprocess mode we get these back. However, it's questionable how relevant these are if they don't aggregate these stats for all processes that make up a vLLM instance.
 
 ## v1 Design
 
@@ -396,9 +375,8 @@ recent metric is used, but only from currently running processes.
 
 This was added in <gh-pr:9477> and there is
 [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
-If we revisit this design and deprecate the old metric, we should reduce
-the need for a significant deprecation period by making the change in
-v0 also and asking this project to move to the new metric.
+If we revisit this design and deprecate the old metric, we should
+coordinate with downstream users so they can migrate before the removal.
 
 ### Prefix Cache metrics
 
@@ -491,7 +469,7 @@ if seq_group.is_finished():
 
 This seems duplicative, and one of them should be removed. The latter
 is used by the Grafana dashboard, so we should deprecate or remove the
-former from v0.
+former.
 
 ### Prefix Cache Hit Rate
 
@@ -500,7 +478,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a
 
 ### KV Cache Offloading
 
-Two v0 metrics relate to a "swapped" preemption mode that is no
+Two legacy metrics relate to a "swapped" preemption mode that is no
 longer relevant in v1:
 
 - `vllm:num_requests_swapped`
@@ -511,7 +489,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
 memory. This is also known as "KV cache offloading" and is configured
 with `--swap-space` and `--preemption-mode`.
 
-In v0, [vLLM has long supported beam search](gh-issue:6226). The
+Historically, [vLLM has long supported beam search](gh-issue:6226). The
 SequenceGroup encapsulated the idea of N Sequences which
 all shared the same prompt kv blocks. This enabled KV cache block
 sharing between requests, and copy-on-write to do branching. CPU
@@ -524,7 +502,7 @@ and the part of the prompt that was evicted can be recomputed.
 
 SequenceGroup was removed in V1, although a replacement will be
 required for "parallel sampling" (`n>1`).
-[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a
+[Beam search was moved out of the core](gh-issue:8306). There was a
 lot of complex code for a very uncommon feature.
 
 In V1, with prefix caching being better (zero over head) and therefore
@@ -535,7 +513,7 @@ better.
 
 ### Parallel Sampling
 
-Some v0 metrics are only relevant in the context of "parallel
+Some legacy metrics are only relevant in the context of "parallel
 sampling". This is where the `n` parameter in a request is used to
 request multiple completions from the same prompt.
 
@@ -554,7 +532,7 @@ also add these metrics.
 
 ### Speculative Decoding
 
-Some v0 metrics are specific to "speculative decoding". This is where
+Some legacy metrics are specific to "speculative decoding". This is where
 we generate candidate tokens using a faster, approximate method or
 model and then validate those tokens with the larger model.
 
@@ -566,7 +544,7 @@ model and then validate those tokens with the larger model.
 
 There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
 speculative decoding to v1. Other techniques will follow. We should
-revisit the v0 metrics in this context.
+revisit these metrics in this context.
 
 !!! note
     We should probably expose acceptance rate as separate accepted
@@ -639,7 +617,7 @@ metrics are often relatively straightforward to add:
    metrics are usually of very limited use unless they can be enabled
    by default and in production.
 3. They have an impact on development and maintenance of the
-   project. Every metric added to v0 has made this v1 effort more
+   project. Every metric added over time has made this effort more
    time-consuming, and perhaps not all metrics justify this ongoing
    investment in their maintenance.
 
@@ -650,7 +628,7 @@ performance and health. Tracing, on the other hand, tracks individual
 requests as they move through different services and components. Both
 fall under the more general heading of "Observability".
 
-v0 has support for OpenTelemetry tracing:
+vLLM has support for OpenTelemetry tracing:
 
 - Added by <gh-pr:4687>
 - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
@@ -663,11 +641,11 @@ OpenTelemetry has a
 [Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
 
 Since metrics is a big enough topic on its own, we are going to tackle
-the topic of tracing in v1 separately.
+the topic of tracing separately.
 
 ### OpenTelemetry Model Forward vs Execute Time
 
-In v0, we have the following two metrics:
+The current implementation exposes the following two metrics:
 
 - `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
   in the model forward pass when this request was in the batch.

diff --git a/docs/design/multiprocessing.md b/docs/design/multiprocessing.md
@@ -60,30 +60,6 @@ Multiple vLLM dependencies indicate either a preference or requirement for using
 It is perhaps more accurate to say that there are known problems with using
 `fork` after initializing these dependencies.
 
-## Current State (v0)
-
-The environment variable `VLLM_WORKER_MULTIPROC_METHOD` can be used to control which method is used by vLLM. The current default is `fork`.
-
-- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/envs.py#L339-L342>
-
-When we know we own the process because the `vllm` command was used, we use
-`spawn` because it's the most widely compatible.
-
-- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/scripts.py#L123-L140>
-
-The `multiproc_xpu_executor` forces the use of `spawn`.
-
-- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/executor/multiproc_xpu_executor.py#L14-L18>
-
-There are other miscellaneous places hard-coding the use of `spawn`:
-
-- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/distributed/device_communicators/all_reduce_utils.py#L135>
-- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/entrypoints/openai/api_server.py#L184>
-
-Related PRs:
-
-- <gh-pr:8823>
-
 ## Prior State in v1
 
 There was an environment variable to control whether multiprocessing is used in

diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md
@@ -94,9 +94,6 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache
 
 With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
 
-!!! note
-    Cache isolation is not supported in engine V0.
-
 ## Data Structure
 
 The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):
@@ -189,7 +186,7 @@ Time 1:
   Cache Blocks: 0, 1, 3
 ```
 
-As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. In v0, when detecting block 3 is duplicated, we free block 3 and let Request 2 use block 1 instead, so its block table becomes `[0, 1]` in Time 1. However, the block table in vLLM v1 is append-only, meaning that changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed.
+As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. Because the block table in vLLM v1 is append-only, changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed.
 
 ### Free
 

diff --git a/docs/features/custom_logitsprocs.md b/docs/features/custom_logitsprocs.md
@@ -166,7 +166,7 @@ The `DummyLogitsProcessor.update_state()` implementation maintains a "sparse" re
 
 ### Wrapping an Existing Request-Level Logits Processor
 
-Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. This will be especially true if your logits processor was developed for vLLM version 0, which required it to be a `Callable` (as described [here](https://docs.vllm.ai/en/v0.10.1.1/api/vllm/logits_process.html)) conforming to the following type annotation:
+Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. Earlier request-level processors were implemented as `Callable` objects conforming to the following type annotation:
 
 ``` python
 RequestLogitsProcessor = Union[

diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode.md
@@ -16,8 +16,8 @@ Speculative decoding is a technique which improves inter-token latency in memory
 The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
 
 !!! warning
-    In vllm v0.10.0, speculative decoding with a draft model is not supported.
-    If you use the following code, you will get a `NotImplementedError`.
+    Speculative decoding with a draft model requires the V1 engine.
+    Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`.
-    Speculative decoding with a draft model requires the V1 engine.
-    Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`.
+    Speculative decoding with a draft model is not supported in vLLM V1 version. 
+    You can use older version before the 0.10x series to continue to leverage it. 
-    Speculative decoding with a draft model requires the V1 engine.
-    Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`.
+    Speculative decoding with a draft model is not supported in vLLM V1 version. 
+    You can use older version before the 0.10x series to continue to leverage it. 
 
 ??? code