Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions docs/configuration/conserving_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,6 @@ llm = LLM(model="adept/fuyu-8b", max_model_len=2048, max_num_seqs=2)

By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.

!!! warning
CUDA graph capture takes up more memory in V1 than in V0.

You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

??? code
Expand Down
4 changes: 1 addition & 3 deletions docs/configuration/optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,7 @@ In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as re

Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.

In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.

With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.

This policy has two benefits:

Expand Down
32 changes: 9 additions & 23 deletions docs/usage/reproducibility.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,33 @@
# Reproducibility

vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. To achieve
reproducible results, you need to turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
reproducible results, consider enabling [batch invariance](../features/batch_invariance.md) as the scheduling
cannot be made deterministic without using offline mode and setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make it more clear? IMO

  • for online serving, you need batch invariance
  • for offline serving, you need either batch invariance or VLLM_ENABLE_V1_MULTIPROCESSING=0


Example: [examples/offline_inference/reproducibility.py](../../examples/offline_inference/reproducibility.py)

!!! warning

Applying the above settings [changes the random state in user code](#locality-of-random-state).

!!! note

Even with the above settings, vLLM only provides reproducibility
when it runs on the same hardware and the same vLLM version.
Also, the online serving API (`vllm serve`) does not support reproducibility
because it is almost impossible to make the scheduling deterministic in the
online setting.

## Setting the global seed

The `seed` parameter in vLLM is used to control the random states for various random number generators.

If a specific seed value is provided, the random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly.

However, in some cases, setting the seed will also [change the random state in user code](#locality-of-random-state).

### Default Behavior

In V1, the `seed` parameter defaults to `0` which sets the random state for each worker, so the results will remain consistent for each vLLM run even if `temperature > 0`.

!!! note
It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs
for workflows such as speculative decoding. For more information, see: <https://github.com/vllm-project/vllm/pull/17929>

It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs
for workflows such as speculative decoding.

For more information, see: <https://github.com/vllm-project/vllm/pull/17929>

### Locality of random state

The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM under the following conditions:
!!! note

- For V0: The seed is specified.
- For V1: The workers are run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM
only if the workers are run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`.

By default, these conditions are not active so you can use vLLM without having to worry about
accidentally making deterministic subsequent operations that rely on random state.
By default, `VLLM_ENABLE_V1_MULTIPROCESSING=1` so you can use vLLM without having to worry about
accidentally making deterministic subsequent operations that rely on random state.
135 changes: 69 additions & 66 deletions docs/usage/v1_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,7 @@

We have fully deprecated V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.

V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).

## Why vLLM V1?
If you have a use case that works on V0 Engine but not V1, please share it on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).

vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.

Expand All @@ -32,16 +30,43 @@ Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

## Current Status
## Differences from V0

This section lists some differences in behavior between V0 and V1.

### Chunked Prefill

Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics.

### CUDA Graphs

CUDA graph capture takes up more memory in V1 than in V0.

### Semantic Changes to Logprobs

#### Logprobs Calculation

By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.

You can adjust this behavior by setting the `--logprobs-mode` flag.
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
Raw means the values before applying any logit processors, like bad words.
Processed means the values after applying all processors, including temperature and top_k/top_p.

#### Prompt Logprobs with Prefix Caching

For each item, our progress towards V1 support falls into one of the following states:
Logprobs are not cached. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.
Comment on lines +59 to +61
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a contradiction in the documentation regarding 'Prompt Logprobs with Prefix Caching'. This section states that for requests with prompt logprobs, 'the engine will ignore the prefix cache'. However, the feature table on line 150 indicates that 'Prompt Logprobs with Prefix Caching' is '🟢 Functional'. These two statements are conflicting. Please clarify the correct behavior and update the documentation to be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @njhill

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement here is correct. I think it's OK to leave it as functional (not optimized). You can link the Prompt Logprobs with Prefix Caching to this section if you want.


## Feature Support

For each item, its support in vLLM V1 falls into one of the following states:

- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
- **🟢 Functional**: Fully operational, with ongoing optimizations.
- **🚧 WIP**: Under active development.
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
- **🟠 Delayed**: Temporarily dropped in V1 but planned to be re-introduced later.
- **🔴 Deprecated**: Not planned for V1 unless there is strong demand.
- **🟢 Functional**: Fully operational with optimizations comparable to or better than V0.
- **🟡 In Progress**: Planned to be in vLLM V1, with open PRs/RFCs.
- **🔴 Removed**: Dropped from vLLM V1. Will only consider re-introducing if there is strong demand.

!!! note
vLLM V1’s unified scheduler treats both prompt and output tokens the same
Expand All @@ -57,13 +82,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the

### Hardware

| Hardware | Status |
|------------|-----------------------------------------------|
| **NVIDIA** | <nobr>🚀</nobr> |
| **AMD** | <nobr>🟢</nobr> |
| Hardware | Status |
|------------------|-----------------------------------------------|
| **NVIDIA** | <nobr>🟢</nobr> |
| **AMD** | <nobr>🟢</nobr> |
| **INTEL GPU** | <nobr>🟢</nobr> |
| **TPU** | <nobr>🟢</nobr> |
| **CPU** | <nobr>🟢 (x86\_64/aarch64) 🟡 (MacOS) </nobr> |
| **TPU** | <nobr>🟢</nobr> |
| **CPU** | <nobr>🟢</nobr> |

!!! note

Expand All @@ -78,23 +103,21 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the

### Models

| Model Type | Status |
|-----------------------------|------------------------------------------------------------------------------------|
| **Decoder-only Models** | <nobr>🚀 Optimized</nobr> |
| **Encoder-Decoder Models** | <nobr>🟢 Whisper only</nobr> |
| **Embedding Models** | <nobr>🟢 Functional</nobr> |
| **Mamba Models** | <nobr>🟢 (Mamba-2), 🟢 (Mamba-1)</nobr> |
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
| Model Type | Status |
|-----------------------------|-------------------------------------------------------------------------|
| **Decoder-only Models** | <nobr>🟢</nobr> |
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
| **Pooling Models** | <nobr>🟢</nobr> |
| **Mamba Models** | <nobr>🟢</nobr> |
| **Multimodal Models** | <nobr>🟢</nobr> |

See below for the status of models that are not yet supported or have more features planned in V1.

#### Embedding Models
#### Pooling Models

The initial basic support is now functional.
Now fully supported, with prefix caching and chunked prefill newly available for last-pooling models.

Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249),
which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360)
to enable simultaneous generation and embedding using the same engine instance in V1.
We are working on enabling prefix caching and chunked prefill for more categories of pooling models.

#### Mamba Models

Expand All @@ -112,24 +135,25 @@ Please note that prefix caching is not yet supported for any of the above models

Whisper is supported. Other models requiring cross-attention between separate
encoder and decoder (e.g., `BartForConditionalGeneration`,
`MllamaForConditionalGeneration`) are not supported.
`MllamaForConditionalGeneration`) are no longer supported.

### Features

| Feature | Status |
|---------------------------------------------|-----------------------------------------------------------------------------------|
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
| **LoRA** | <nobr>🚀 Optimized</nobr> |
| **Prefix Caching** | <nobr>🟢 Functional</nobr> |
| **Chunked Prefill** | <nobr>🟢 Functional</nobr> |
| **LoRA** | <nobr>🟢 Functional</nobr> |
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices (<https://github.com/vllm-project/vllm/pull/15191>)</nobr>|
| **Spec Decode** | <nobr>🚀 Optimized</nobr> |
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
| **FP8 KV Cache** | <nobr>🟢 Functional</nobr> |
| **Spec Decode** | <nobr>🟢 Functional</nobr> |
| **Prompt Logprobs with Prefix Caching** | <nobr>🟢 Functional</nobr> |
| **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> |
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
| **Concurrent Partial Prefills** | <nobr>🟡 [In Progress](https://github.com/vllm-project/vllm/issues/14003)</nobr> |
| **best_of** | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/issues/13361)</nobr> |
| **Per-Request Logits Processors** | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/pull/13360)</nobr> |
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Removed</nobr> |
| **Request-level Structured Output Backend** | <nobr>🔴 Removed</nobr> |

!!! note

Expand All @@ -139,37 +163,16 @@ encoder and decoder (e.g., `BartForConditionalGeneration`,
prefix caching, and speculative decoding without a strict separation between prefill
and decode phases.

#### Semantic Changes to Logprobs

vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0:

##### Logprobs Calculation

By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.

You can adjust this behavior by setting the `--logprobs-mode` flag.
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
Raw means the values before applying any logit processors, like bad words.
Processed means the values after applying all processors, including temperature and top_k/top_p.

##### Prompt Logprobs with Prefix Caching

Logprobs are not cached. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.

#### Deprecated Features
#### Removed Features

As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
As part of the major architectural rework in vLLM V1, several legacy features have been removed.

##### Sampling features

- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **best_of**: This feature has been removed due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **Per-Request Logits Processors**: In V0, users could pass custom
processing functions to adjust logits on a per-request basis. In vLLM V1, this
feature has been deprecated. Instead, we now support **global logits processors**
feature has been removed. Instead, we now support **global logits processors**
which are set at startup time, see [RFC #17799](https://github.com/vllm-project/vllm/issues/17799).

##### KV Cache features
Expand All @@ -179,4 +182,4 @@ to handle request preemptions.

##### Structured Output features

- **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.
- **Request-level Structured Output Backend**: Removed; alternative backends (outlines, guidance) with fallbacks are supported now.
4 changes: 2 additions & 2 deletions examples/offline_inference/reproducibility.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@

from vllm import LLM, SamplingParams

# Turn off multiprocessing to make the scheduling deterministic.
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
# Enable batch invariance to get consistent results regardless of scheduling.
os.environ["VLLM_BATCH_INVARIANT"] = "1"

prompts = [
"Hello, my name is",
Expand Down
7 changes: 0 additions & 7 deletions tests/models/language/generation/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,6 @@
from ...registry import HF_EXAMPLE_MODELS
from ...utils import check_logprobs_close

# These have unsupported head_dim for FA. We do not
# have a clean way to fall back, so we fail with
# a clear msg when it happens.
# https://github.com/vllm-project/vllm/issues/14524
# NOTE(woosuk): Skipping these tests until V1 supports them.
# REQUIRES_V0 = ["microsoft/phi-2", "stabilityai/stablelm-3b-4e1t"]

# This list contains the model that are using AITER kernel.
# Skip model that are not using AITER tests.
# When more AITER kernels are added, this list will not be
Expand Down
Loading