diff --git a/docs/configuration/conserving_memory.md b/docs/configuration/conserving_memory.md index efda9c8e019e..a12e1705f00a 100644 --- a/docs/configuration/conserving_memory.md +++ b/docs/configuration/conserving_memory.md @@ -53,7 +53,7 @@ llm = LLM(model="adept/fuyu-8b", By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU. !!! warning - CUDA graph capture takes up more memory in V1 than in V0. + CUDA graph capture increases GPU memory usage. Adjust capture sizes if you need to conserve memory. You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md index 5c74610ebd29..ba7433d6a920 100644 --- a/docs/configuration/optimization.md +++ b/docs/configuration/optimization.md @@ -33,7 +33,7 @@ In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as re Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations. -In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics. +In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it. @@ -49,7 +49,7 @@ You can tune the performance by adjusting `max_num_batched_tokens`: - Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes. - Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch. - For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs. -- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes). +- If `max_num_batched_tokens` is the same as `max_model_len`, the scheduler behaves similarly to the legacy policy where large prefills ran without chunking (while still prioritizing decodes). ```python from vllm import LLM diff --git a/docs/contributing/model/basic.md b/docs/contributing/model/basic.md index aafdb1058e03..f318200c8f51 100644 --- a/docs/contributing/model/basic.md +++ b/docs/contributing/model/basic.md @@ -133,8 +133,7 @@ We consider 3 different scenarios: For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](gh-file:vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](gh-file:vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference. The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config. For the mamba layers themselves, please use the [`MambaMixer`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes. -Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations. -V0-only classes and code will be removed in the very near future. +Please avoid reintroducing legacy cache managers such as `MambaCacheManager` or any previously removed code paths from older implementations. The model should also be added to the `MODELS_CONFIG_MAP` dictionary in to ensure that the runtime defaults are optimized. For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](gh-file:vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](gh-file:vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together). diff --git a/docs/design/metrics.md b/docs/design/metrics.md index 90b2fd32f297..592128ee053e 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -1,12 +1,12 @@ # Metrics -Ensure the v1 LLM Engine exposes a superset of the metrics available in v0. +vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine. ## Objectives -- Achieve parity of metrics between v0 and v1. -- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments. -- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases. +- Provide comprehensive coverage of engine and request level metrics to aid production monitoring. +- Prioritize Prometheus integrations, as this is what we expect to be used in production environments. +- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases. ## Background @@ -17,9 +17,9 @@ Metrics in vLLM can be categorized as follows: The mental model is that server-level metrics help explain the values of request-level metrics. -### v0 Metrics +### Metrics Overview -In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix: +The following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix and are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md): - `vllm:num_requests_running` (Gauge) - `vllm:num_requests_swapped` (Gauge) @@ -57,8 +57,6 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` - `vllm:spec_decode_num_draft_tokens_total` (Counter) - `vllm:spec_decode_num_emitted_tokens_total` (Counter) -These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md). - ### Grafana Dashboard vLLM also provides [a reference example](../examples/online_serving/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. @@ -86,7 +84,7 @@ See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful b Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs. -With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657): +During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657): ```bash $ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*' @@ -97,10 +95,6 @@ http_request_duration_highr_seconds_count 201.0 http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201.0 ``` -### Multi-process Mode - -In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See . - ### Built in Python/Process Metrics The following metrics are supported by default by `prometheus_client`, but they are not exposed when multiprocess mode is used: @@ -116,22 +110,7 @@ The following metrics are supported by default by `prometheus_client`, but they - `process_open_fds` - `process_max_fds` -This is relevant because if we move away from multiprocess mode in v1, -we get these back. However, it's questionable how relevant these are -if they don't aggregate these stats for all processes that make up a -vLLM instance. - -### v0 PRs and Issues - -For background, these are some of the relevant PRs which added the v0 metrics: - -- -- -- -- -- - -Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781). +This is relevant because if we move away from multiprocess mode we get these back. However, it's questionable how relevant these are if they don't aggregate these stats for all processes that make up a vLLM instance. ## v1 Design @@ -396,9 +375,8 @@ recent metric is used, but only from currently running processes. This was added in and there is [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). -If we revisit this design and deprecate the old metric, we should reduce -the need for a significant deprecation period by making the change in -v0 also and asking this project to move to the new metric. +If we revisit this design and deprecate the old metric, we should +coordinate with downstream users so they can migrate before the removal. ### Prefix Cache metrics @@ -491,7 +469,7 @@ if seq_group.is_finished(): This seems duplicative, and one of them should be removed. The latter is used by the Grafana dashboard, so we should deprecate or remove the -former from v0. +former. ### Prefix Cache Hit Rate @@ -500,7 +478,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a ### KV Cache Offloading -Two v0 metrics relate to a "swapped" preemption mode that is no +Two legacy metrics relate to a "swapped" preemption mode that is no longer relevant in v1: - `vllm:num_requests_swapped` @@ -511,7 +489,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU memory. This is also known as "KV cache offloading" and is configured with `--swap-space` and `--preemption-mode`. -In v0, [vLLM has long supported beam search](gh-issue:6226). The +Historically, [vLLM has long supported beam search](gh-issue:6226). The SequenceGroup encapsulated the idea of N Sequences which all shared the same prompt kv blocks. This enabled KV cache block sharing between requests, and copy-on-write to do branching. CPU @@ -524,7 +502,7 @@ and the part of the prompt that was evicted can be recomputed. SequenceGroup was removed in V1, although a replacement will be required for "parallel sampling" (`n>1`). -[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a +[Beam search was moved out of the core](gh-issue:8306). There was a lot of complex code for a very uncommon feature. In V1, with prefix caching being better (zero over head) and therefore @@ -535,7 +513,7 @@ better. ### Parallel Sampling -Some v0 metrics are only relevant in the context of "parallel +Some legacy metrics are only relevant in the context of "parallel sampling". This is where the `n` parameter in a request is used to request multiple completions from the same prompt. @@ -554,7 +532,7 @@ also add these metrics. ### Speculative Decoding -Some v0 metrics are specific to "speculative decoding". This is where +Some legacy metrics are specific to "speculative decoding". This is where we generate candidate tokens using a faster, approximate method or model and then validate those tokens with the larger model. @@ -566,7 +544,7 @@ model and then validate those tokens with the larger model. There is a PR under review () to add "prompt lookup (ngram)" speculative decoding to v1. Other techniques will follow. We should -revisit the v0 metrics in this context. +revisit these metrics in this context. !!! note We should probably expose acceptance rate as separate accepted @@ -639,7 +617,7 @@ metrics are often relatively straightforward to add: metrics are usually of very limited use unless they can be enabled by default and in production. 3. They have an impact on development and maintenance of the - project. Every metric added to v0 has made this v1 effort more + project. Every metric added over time has made this effort more time-consuming, and perhaps not all metrics justify this ongoing investment in their maintenance. @@ -650,7 +628,7 @@ performance and health. Tracing, on the other hand, tracks individual requests as they move through different services and components. Both fall under the more general heading of "Observability". -v0 has support for OpenTelemetry tracing: +vLLM has support for OpenTelemetry tracing: - Added by - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces` @@ -663,11 +641,11 @@ OpenTelemetry has a [Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md). Since metrics is a big enough topic on its own, we are going to tackle -the topic of tracing in v1 separately. +the topic of tracing separately. ### OpenTelemetry Model Forward vs Execute Time -In v0, we have the following two metrics: +The current implementation exposes the following two metrics: - `vllm:model_forward_time_milliseconds` (Histogram) - The time spent in the model forward pass when this request was in the batch. diff --git a/docs/design/multiprocessing.md b/docs/design/multiprocessing.md index 6e92b20d267b..4256d6dcf633 100644 --- a/docs/design/multiprocessing.md +++ b/docs/design/multiprocessing.md @@ -60,30 +60,6 @@ Multiple vLLM dependencies indicate either a preference or requirement for using It is perhaps more accurate to say that there are known problems with using `fork` after initializing these dependencies. -## Current State (v0) - -The environment variable `VLLM_WORKER_MULTIPROC_METHOD` can be used to control which method is used by vLLM. The current default is `fork`. - -- - -When we know we own the process because the `vllm` command was used, we use -`spawn` because it's the most widely compatible. - -- - -The `multiproc_xpu_executor` forces the use of `spawn`. - -- - -There are other miscellaneous places hard-coding the use of `spawn`: - -- -- - -Related PRs: - -- - ## Prior State in v1 There was an environment variable to control whether multiprocessing is used in diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md index 9941837bf165..ca6278873441 100644 --- a/docs/design/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -94,9 +94,6 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others. -!!! note - Cache isolation is not supported in engine V0. - ## Data Structure The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified): @@ -189,7 +186,7 @@ Time 1: Cache Blocks: 0, 1, 3 ``` -As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. In v0, when detecting block 3 is duplicated, we free block 3 and let Request 2 use block 1 instead, so its block table becomes `[0, 1]` in Time 1. However, the block table in vLLM v1 is append-only, meaning that changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed. +As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. Because the block table in vLLM v1 is append-only, changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed. ### Free diff --git a/docs/features/custom_logitsprocs.md b/docs/features/custom_logitsprocs.md index 201b340c5972..3e50d49e8497 100644 --- a/docs/features/custom_logitsprocs.md +++ b/docs/features/custom_logitsprocs.md @@ -166,7 +166,7 @@ The `DummyLogitsProcessor.update_state()` implementation maintains a "sparse" re ### Wrapping an Existing Request-Level Logits Processor -Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. This will be especially true if your logits processor was developed for vLLM version 0, which required it to be a `Callable` (as described [here](https://docs.vllm.ai/en/v0.10.1.1/api/vllm/logits_process.html)) conforming to the following type annotation: +Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. Earlier request-level processors were implemented as `Callable` objects conforming to the following type annotation: ``` python RequestLogitsProcessor = Union[ diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode.md index 25c308a6ff20..3b446cb4b441 100644 --- a/docs/features/spec_decode.md +++ b/docs/features/spec_decode.md @@ -16,8 +16,8 @@ Speculative decoding is a technique which improves inter-token latency in memory The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time. !!! warning - In vllm v0.10.0, speculative decoding with a draft model is not supported. - If you use the following code, you will get a `NotImplementedError`. + Speculative decoding with a draft model requires the V1 engine. + Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`. ??? code diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 60fe5b887952..7a22564398bd 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -602,8 +602,9 @@ On the other hand, modalities separated by `/` are mutually exclusive. See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model. !!! important - **To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference) - or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt: + You can control the maximum number of multimodal inputs per prompt by setting + `limit_mm_per_prompt` (offline inference) or `--limit-mm-per-prompt` (online + serving). For example, to enable passing up to 4 images per text prompt: Offline inference: @@ -622,8 +623,6 @@ See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inp vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt '{"image":4}' ``` - **This is no longer required if you are using vLLM V1.** - !!! tip For hybrid-only models such as Llama-4, Step3 and Mistral-3, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (e.g, `--limit-mm-per-prompt '{"image":0}`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache. @@ -731,16 +730,7 @@ Some models are supported only via the [Transformers backend](#transformers). Th + Multiple items can be inputted per text prompt for this modality. !!! warning - Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs. - However, there are differences in how they handle text + image inputs: - - V0 correctly implements the model's attention pattern: - - Uses bidirectional attention between the image tokens corresponding to the same image - - Uses causal attention for other tokens - - Implemented via (naive) PyTorch SDPA with masking tensors - - Note: May use significant memory for long prompts with image - - V1 currently uses a simplified attention pattern: + `Gemma3ForConditionalGeneration` uses a simplified attention pattern for text + image inputs: - Uses causal attention for all tokens, including image tokens - Generates reasonable outputs but does not match the original model's attention for text + image inputs, especially when `{"do_pan_and_scan": true}` - Will be updated in the future to support the correct behavior @@ -798,11 +788,11 @@ Some models are supported only via the [Transformers backend](#transformers). Th For more details, please see: !!! warning - Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1. + Our PaliGemma implementations currently share the same attention limitation as Gemma 3 (see above). !!! note For Qwen2.5-Omni, reading audio from video pre-processing (`--mm-processor-kwargs '{"use_audio_in_video": true}'`) - is currently supported on V0 (but not V1), because overlapping modalities is not yet supported in V1. + is currently unsupported because overlapping modalities are not yet supported. #### Transcription diff --git a/docs/usage/reproducibility.md b/docs/usage/reproducibility.md index a494dcf19191..df81cb48d17a 100644 --- a/docs/usage/reproducibility.md +++ b/docs/usage/reproducibility.md @@ -1,10 +1,9 @@ # Reproducibility -vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. You need to do the following to achieve -reproducible results: +vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. You need to do the following to achieve reproducible results: -- For V1: Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`. -- For V0: Set the global seed (see below). +- Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`. +- Optionally configure the global seed if you need to control random sampling (see below). Example: @@ -30,9 +29,7 @@ However, in some cases, setting the seed will also [change the random state in u ### Default Behavior -In V0, the `seed` parameter defaults to `None`. When the `seed` parameter is `None`, the random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that each run of vLLM will produce different results if `temperature > 0`, as expected. - -In V1, the `seed` parameter defaults to `0` which sets the random state for each worker, so the results will remain consistent for each vLLM run even if `temperature > 0`. +The `seed` parameter defaults to `0`, which sets the random state for each worker so the results remain consistent for each vLLM run even if `temperature > 0`. !!! note @@ -43,10 +40,6 @@ In V1, the `seed` parameter defaults to `0` which sets the random state for each ### Locality of random state -The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM under the following conditions: - -- For V0: The seed is specified. -- For V1: The workers are run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`. +The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM when the workers run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`. -By default, these conditions are not active so you can use vLLM without having to worry about -accidentally making deterministic subsequent operations that rely on random state. +By default, this condition is not active so you can use vLLM without having to worry about accidentally making deterministic subsequent operations that rely on random state. diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index 340aaf54bb72..4980006eda10 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -1,22 +1,16 @@ # vLLM V1 -!!! announcement - - We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details. - V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack). To disable V1, please set the environment variable as: `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason! ## Why vLLM V1? -vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design. - -Building on V0’s success, vLLM V1 retains the stable and proven components from V0 -(such as the models, GPU kernels, and utilities). At the same time, it significantly -re-architects the core systems, covering the scheduler, KV cache manager, worker, -sampler, and API server, to provide a cohesive, maintainable framework that better -accommodates continued growth and innovation. +vLLM V1 re-architects the engine to reduce accumulated complexity while preserving +the stable, battle-tested components users rely on (such as models, GPU kernels, +and supporting utilities). The scheduler, KV cache manager, worker, sampler, and +API server now operate within a cohesive framework that is easier to extend and +maintain as new capabilities are added. Specifically, V1 aims to: @@ -88,8 +82,6 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the | **Mamba Models** | 🟢 (Mamba-2), 🟢 (Mamba-1) | | **Multimodal Models** | 🟢 Functional | -vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol. - !!! tip This corresponds to the V1 column in our [list of supported models](../models/supported_models.md). @@ -149,8 +141,8 @@ encoder and decoder (e.g., `BartForConditionalGeneration`, #### Semantic Changes to Logprobs -vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic -differences compared to V0: +vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantics +to consider: ##### Logprobs Calculation @@ -175,7 +167,7 @@ As part of the major architectural rework in vLLM V1, several legacy features ha ##### Sampling features - **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361). -- **Per-Request Logits Processors**: In V0, users could pass custom +- **Per-Request Logits Processors**: Previously, users could pass custom processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been deprecated. Instead, the design is moving toward supporting **global logits processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).