Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/configuration/conserving_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ llm = LLM(model="adept/fuyu-8b",
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.

!!! warning
CUDA graph capture takes up more memory in V1 than in V0.
CUDA graph capture increases GPU memory usage. Adjust capture sizes if you need to conserve memory.

You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

Expand Down
4 changes: 2 additions & 2 deletions docs/configuration/optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as re

Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.

In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.
In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models.
In vLLM V1, **chunked prefill is always enabled by default**.


With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.

Expand All @@ -49,7 +49,7 @@ You can tune the performance by adjusting `max_num_batched_tokens`:
- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
- If `max_num_batched_tokens` is the same as `max_model_len`, the scheduler behaves similarly to the legacy policy where large prefills ran without chunking (while still prioritizing decodes).

```python
from vllm import LLM
Expand Down
3 changes: 1 addition & 2 deletions docs/contributing/model/basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,7 @@ We consider 3 different scenarios:
For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](gh-file:vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](gh-file:vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference.
The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config.
For the mamba layers themselves, please use the [`MambaMixer`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes.
Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations.
V0-only classes and code will be removed in the very near future.
Please avoid reintroducing legacy cache managers such as `MambaCacheManager` or any previously removed code paths from older implementations.
The model should also be added to the `MODELS_CONFIG_MAP` dictionary in <gh-file:vllm/model_executor/models/config.py> to ensure that the runtime defaults are optimized.

For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](gh-file:vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](gh-file:vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together).
Expand Down
64 changes: 21 additions & 43 deletions docs/design/metrics.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably some mistakes here. @markmc PTAL

Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Metrics

Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine.

## Objectives

- Achieve parity of metrics between v0 and v1.
- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
- Provide comprehensive coverage of engine and request level metrics to aid production monitoring.
- Prioritize Prometheus integrations, as this is what we expect to be used in production environments.
- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases.

## Background

Expand All @@ -17,9 +17,9 @@ Metrics in vLLM can be categorized as follows:

The mental model is that server-level metrics help explain the values of request-level metrics.

### v0 Metrics
### Metrics Overview

In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
The following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix and are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md):

- `vllm:num_requests_running` (Gauge)
- `vllm:num_requests_swapped` (Gauge)
Expand Down Expand Up @@ -57,8 +57,6 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)

These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md).

### Grafana Dashboard

vLLM also provides [a reference example](../examples/online_serving/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
Expand Down Expand Up @@ -86,7 +84,7 @@ See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful b

Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs.

With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657):
During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657):

```bash
$ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
Expand All @@ -97,10 +95,6 @@ http_request_duration_highr_seconds_count 201.0
http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201.0
```

### Multi-process Mode

In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <gh-pr:7279>.

### Built in Python/Process Metrics

The following metrics are supported by default by `prometheus_client`, but they are not exposed when multiprocess mode is used:
Expand All @@ -116,22 +110,7 @@ The following metrics are supported by default by `prometheus_client`, but they
- `process_open_fds`
- `process_max_fds`

This is relevant because if we move away from multiprocess mode in v1,
we get these back. However, it's questionable how relevant these are
if they don't aggregate these stats for all processes that make up a
vLLM instance.

### v0 PRs and Issues

For background, these are some of the relevant PRs which added the v0 metrics:

- <gh-pr:1890>
- <gh-pr:2316>
- <gh-pr:2730>
- <gh-pr:4464>
- <gh-pr:7279>

Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781).
This is relevant because if we move away from multiprocess mode we get these back. However, it's questionable how relevant these are if they don't aggregate these stats for all processes that make up a vLLM instance.

## v1 Design

Expand Down Expand Up @@ -396,9 +375,8 @@ recent metric is used, but only from currently running processes.

This was added in <gh-pr:9477> and there is
[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
If we revisit this design and deprecate the old metric, we should reduce
the need for a significant deprecation period by making the change in
v0 also and asking this project to move to the new metric.
If we revisit this design and deprecate the old metric, we should
coordinate with downstream users so they can migrate before the removal.

### Prefix Cache metrics

Expand Down Expand Up @@ -491,7 +469,7 @@ if seq_group.is_finished():

This seems duplicative, and one of them should be removed. The latter
is used by the Grafana dashboard, so we should deprecate or remove the
former from v0.
former.

### Prefix Cache Hit Rate

Expand All @@ -500,7 +478,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a

### KV Cache Offloading

Two v0 metrics relate to a "swapped" preemption mode that is no
Two legacy metrics relate to a "swapped" preemption mode that is no
longer relevant in v1:

- `vllm:num_requests_swapped`
Expand All @@ -511,7 +489,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
memory. This is also known as "KV cache offloading" and is configured
with `--swap-space` and `--preemption-mode`.

In v0, [vLLM has long supported beam search](gh-issue:6226). The
Historically, [vLLM has long supported beam search](gh-issue:6226). The
SequenceGroup encapsulated the idea of N Sequences which
all shared the same prompt kv blocks. This enabled KV cache block
sharing between requests, and copy-on-write to do branching. CPU
Expand All @@ -524,7 +502,7 @@ and the part of the prompt that was evicted can be recomputed.

SequenceGroup was removed in V1, although a replacement will be
required for "parallel sampling" (`n>1`).
[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a
[Beam search was moved out of the core](gh-issue:8306). There was a
lot of complex code for a very uncommon feature.

In V1, with prefix caching being better (zero over head) and therefore
Expand All @@ -535,7 +513,7 @@ better.

### Parallel Sampling

Some v0 metrics are only relevant in the context of "parallel
Some legacy metrics are only relevant in the context of "parallel
sampling". This is where the `n` parameter in a request is used to
request multiple completions from the same prompt.

Expand All @@ -554,7 +532,7 @@ also add these metrics.

### Speculative Decoding

Some v0 metrics are specific to "speculative decoding". This is where
Some legacy metrics are specific to "speculative decoding". This is where
we generate candidate tokens using a faster, approximate method or
model and then validate those tokens with the larger model.

Expand All @@ -566,7 +544,7 @@ model and then validate those tokens with the larger model.

There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
speculative decoding to v1. Other techniques will follow. We should
revisit the v0 metrics in this context.
revisit these metrics in this context.

!!! note
We should probably expose acceptance rate as separate accepted
Expand Down Expand Up @@ -639,7 +617,7 @@ metrics are often relatively straightforward to add:
metrics are usually of very limited use unless they can be enabled
by default and in production.
3. They have an impact on development and maintenance of the
project. Every metric added to v0 has made this v1 effort more
project. Every metric added over time has made this effort more
time-consuming, and perhaps not all metrics justify this ongoing
investment in their maintenance.

Expand All @@ -650,7 +628,7 @@ performance and health. Tracing, on the other hand, tracks individual
requests as they move through different services and components. Both
fall under the more general heading of "Observability".

v0 has support for OpenTelemetry tracing:
vLLM has support for OpenTelemetry tracing:

- Added by <gh-pr:4687>
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
Expand All @@ -663,11 +641,11 @@ OpenTelemetry has a
[Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).

Since metrics is a big enough topic on its own, we are going to tackle
the topic of tracing in v1 separately.
the topic of tracing separately.

### OpenTelemetry Model Forward vs Execute Time

In v0, we have the following two metrics:
The current implementation exposes the following two metrics:

- `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
in the model forward pass when this request was in the batch.
Expand Down
24 changes: 0 additions & 24 deletions docs/design/multiprocessing.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill I guess this page can use a full clean up

Original file line number Diff line number Diff line change
Expand Up @@ -60,30 +60,6 @@ Multiple vLLM dependencies indicate either a preference or requirement for using
It is perhaps more accurate to say that there are known problems with using
`fork` after initializing these dependencies.

## Current State (v0)

The environment variable `VLLM_WORKER_MULTIPROC_METHOD` can be used to control which method is used by vLLM. The current default is `fork`.

- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/envs.py#L339-L342>

When we know we own the process because the `vllm` command was used, we use
`spawn` because it's the most widely compatible.

- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/scripts.py#L123-L140>

The `multiproc_xpu_executor` forces the use of `spawn`.

- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/executor/multiproc_xpu_executor.py#L14-L18>

There are other miscellaneous places hard-coding the use of `spawn`:

- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/distributed/device_communicators/all_reduce_utils.py#L135>
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/entrypoints/openai/api_server.py#L184>

Related PRs:

- <gh-pr:8823>

## Prior State in v1

There was an environment variable to control whether multiprocessing is used in
Expand Down
5 changes: 1 addition & 4 deletions docs/design/prefix_caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,6 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache

With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.

!!! note
Cache isolation is not supported in engine V0.

## Data Structure

The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):
Expand Down Expand Up @@ -189,7 +186,7 @@ Time 1:
Cache Blocks: 0, 1, 3
```

As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. In v0, when detecting block 3 is duplicated, we free block 3 and let Request 2 use block 1 instead, so its block table becomes `[0, 1]` in Time 1. However, the block table in vLLM v1 is append-only, meaning that changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed.
As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. Because the block table in vLLM v1 is append-only, changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed.

### Free

Expand Down
2 changes: 1 addition & 1 deletion docs/features/custom_logitsprocs.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ The `DummyLogitsProcessor.update_state()` implementation maintains a "sparse" re

### Wrapping an Existing Request-Level Logits Processor

Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. This will be especially true if your logits processor was developed for vLLM version 0, which required it to be a `Callable` (as described [here](https://docs.vllm.ai/en/v0.10.1.1/api/vllm/logits_process.html)) conforming to the following type annotation:
Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. Earlier request-level processors were implemented as `Callable` objects conforming to the following type annotation:

``` python
RequestLogitsProcessor = Union[
Expand Down
4 changes: 2 additions & 2 deletions docs/features/spec_decode.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ Speculative decoding is a technique which improves inter-token latency in memory
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.

!!! warning
In vllm v0.10.0, speculative decoding with a draft model is not supported.
If you use the following code, you will get a `NotImplementedError`.
Speculative decoding with a draft model requires the V1 engine.
Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`.
Comment on lines +19 to +20
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Speculative decoding with a draft model requires the V1 engine.
Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`.
Speculative decoding with a draft model is not supported in vLLM V1 version.
You can use older version before the 0.10x series to continue to leverage it.


??? code

Expand Down
Loading
Loading