Skip to content

Commit 2cc5016

Browse files
authored
[Docs] Clean up v1/metrics.md (#21449)
Signed-off-by: windsonsea <[email protected]>
1 parent 6929f8b commit 2cc5016

File tree

1 file changed

+73
-92
lines changed

1 file changed

+73
-92
lines changed

docs/design/v1/metrics.md

Lines changed: 73 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
55
## Objectives
66

77
- Achieve parity of metrics between v0 and v1.
8-
- The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments.
9-
- Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
8+
- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
9+
- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
1010

1111
## Background
1212

1313
Metrics in vLLM can be categorized as follows:
1414

15-
1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
16-
2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
15+
1. Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
16+
2. Request-level metrics: Metrics that track the characteristics (e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking.
1717

18-
The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.
18+
The mental model is that server-level metrics help explain the values of request-level metrics.
1919

2020
### v0 Metrics
2121

@@ -65,20 +65,20 @@ vLLM also provides [a reference example](../../examples/online_serving/prometheu
6565

6666
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
6767

68-
- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds
69-
- `vllm:prompt_tokens_total` - Prompt Tokens
70-
- `vllm:generation_tokens_total` - Generation Tokens
71-
- `vllm:time_per_output_token_seconds` - Inter token latency (Time Per Output Token, TPOT) in second.
68+
- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds.
69+
- `vllm:prompt_tokens_total` - Prompt tokens.
70+
- `vllm:generation_tokens_total` - Generation tokens.
71+
- `vllm:time_per_output_token_seconds` - Inter-token latency (Time Per Output Token, TPOT) in seconds.
7272
- `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds.
73-
- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in RUNNING, WAITING, and SWAPPED state
73+
- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in the RUNNING, WAITING, and SWAPPED states.
7474
- `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM.
75-
- `vllm:request_prompt_tokens` - Request prompt length
76-
- `vllm:request_generation_tokens` - request generation length
77-
- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached
78-
- `vllm:request_queue_time_seconds` - Queue Time
79-
- `vllm:request_prefill_time_seconds` - Requests Prefill Time
80-
- `vllm:request_decode_time_seconds` - Requests Decode Time
81-
- `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group
75+
- `vllm:request_prompt_tokens` - Request prompt length.
76+
- `vllm:request_generation_tokens` - Request generation length.
77+
- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
78+
- `vllm:request_queue_time_seconds` - Queue time.
79+
- `vllm:request_prefill_time_seconds` - Requests prefill time.
80+
- `vllm:request_decode_time_seconds` - Requests decode time.
81+
- `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group.
8282

8383
See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
8484

@@ -103,7 +103,7 @@ In v0, metrics are collected in the engine core process and we use multi-process
103103

104104
### Built in Python/Process Metrics
105105

106-
The following metrics are supported by default by `prometheus_client`, but the are not exposed with multiprocess mode is used:
106+
The following metrics are supported by default by `prometheus_client`, but they are not exposed when multi-process mode is used:
107107

108108
- `python_gc_objects_collected_total`
109109
- `python_gc_objects_uncollectable_total`
@@ -158,6 +158,7 @@ In v1, we wish to move computation and overhead out of the engine core
158158
process to minimize the time between each forward pass.
159159

160160
The overall idea of V1 EngineCore design is:
161+
161162
- EngineCore is the inner loop. Performance is most critical here
162163
- AsyncLLM is the outer loop. This is overlapped with GPU execution
163164
(ideally), so this is where any "overheads" should be if
@@ -178,7 +179,7 @@ time" (`time.time()`) to calculate intervals as the former is
178179
unaffected by system clock changes (e.g. from NTP).
179180

180181
It's also important to note that monotonic clocks differ between
181-
processes - each process has its own reference. point. So it is
182+
processes - each process has its own reference point. So it is
182183
meaningless to compare monotonic timestamps from different processes.
183184

184185
Therefore, in order to calculate an interval, we must compare two
@@ -343,14 +344,15 @@ vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.
343344
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
344345
```
345346

346-
Note - the choice of histogram buckets to be most useful to users
347-
across a broad set of use cases is not straightforward and will
348-
require refinement over time.
347+
!!! note
348+
The choice of histogram buckets to be most useful to users
349+
across a broad set of use cases is not straightforward and will
350+
require refinement over time.
349351

350352
### Cache Config Info
351353

352-
`prometheus_client` has support for [Info
353-
metrics](https://prometheus.github.io/client_python/instrumenting/info/)
354+
`prometheus_client` has support for
355+
[Info metrics](https://prometheus.github.io/client_python/instrumenting/info/)
354356
which are equivalent to a `Gauge` whose value is permanently set to 1,
355357
but exposes interesting key/value pair information via labels. This is
356358
used for information about an instance that does not change - so it
@@ -363,14 +365,11 @@ We use this concept for the `vllm:cache_config_info` metric:
363365
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
364366
# TYPE vllm:cache_config_info gauge
365367
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
366-
367368
```
368369

369-
However, `prometheus_client` has [never supported Info metrics in
370-
multiprocessing
371-
mode](https://github.com/prometheus/client_python/pull/300) - for
372-
[unclear
373-
reasons](gh-pr:7279#discussion_r1710417152). We
370+
However, `prometheus_client` has
371+
[never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
372+
for [unclear reasons](gh-pr:7279#discussion_r1710417152). We
374373
simply use a `Gauge` metric set to 1 and
375374
`multiprocess_mode="mostrecent"` instead.
376375

@@ -395,11 +394,9 @@ distinguish between per-adapter counts. This should be revisited.
395394
Note that `multiprocess_mode="livemostrecent"` is used - the most
396395
recent metric is used, but only from currently running processes.
397396

398-
This was added in
399-
<gh-pr:9477> and there is
400-
[at least one known
401-
user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If
402-
we revisit this design and deprecate the old metric, we should reduce
397+
This was added in <gh-pr:9477> and there is
398+
[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
399+
If we revisit this design and deprecate the old metric, we should reduce
403400
the need for a significant deprecation period by making the change in
404401
v0 also and asking this project to move to the new metric.
405402

@@ -442,23 +439,20 @@ suddenly (from their perspective) when it is removed, even if there is
442439
an equivalent metric for them to use.
443440

444441
As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
445-
[deprecated](gh-pr:2764) (with a
446-
comment in the code),
447-
[removed](gh-pr:12383), and then
448-
[noticed by a
449-
user](gh-issue:13218).
442+
[deprecated](gh-pr:2764) (with a comment in the code),
443+
[removed](gh-pr:12383), and then [noticed by a user](gh-issue:13218).
450444

451445
In general:
452446

453-
1) We should be cautious about deprecating metrics, especially since
447+
1. We should be cautious about deprecating metrics, especially since
454448
it can be hard to predict the user impact.
455-
2) We should include a prominent deprecation notice in the help string
449+
2. We should include a prominent deprecation notice in the help string
456450
that is included in the `/metrics' output.
457-
3) We should list deprecated metrics in user-facing documentation and
451+
3. We should list deprecated metrics in user-facing documentation and
458452
release notes.
459-
4) We should consider hiding deprecated metrics behind a CLI argument
460-
in order to give administrators [an escape
461-
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
453+
4. We should consider hiding deprecated metrics behind a CLI argument
454+
in order to give administrators
455+
[an escape hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
462456
for some time before deleting them.
463457

464458
See the [deprecation policy](../../contributing/deprecation_policy.md) for
@@ -474,15 +468,15 @@ removed.
474468
The `vllm:time_in_queue_requests` Histogram metric was added by
475469
<gh-pr:9659> and its calculation is:
476470

477-
```
471+
```python
478472
self.metrics.first_scheduled_time = now
479473
self.metrics.time_in_queue = now - self.metrics.arrival_time
480474
```
481475

482476
Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
483477
us with:
484478

485-
```
479+
```python
486480
if seq_group.is_finished():
487481
if (seq_group.metrics.first_scheduled_time is not None and
488482
seq_group.metrics.first_token_time is not None):
@@ -517,8 +511,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
517511
memory. This is also known as "KV cache offloading" and is configured
518512
with `--swap-space` and `--preemption-mode`.
519513

520-
In v0, [vLLM has long supported beam
521-
search](gh-issue:6226). The
514+
In v0, [vLLM has long supported beam search](gh-issue:6226). The
522515
SequenceGroup encapsulated the idea of N Sequences which
523516
all shared the same prompt kv blocks. This enabled KV cache block
524517
sharing between requests, and copy-on-write to do branching. CPU
@@ -530,9 +523,8 @@ option than CPU swapping since blocks can be evicted slowly on demand
530523
and the part of the prompt that was evicted can be recomputed.
531524

532525
SequenceGroup was removed in V1, although a replacement will be
533-
required for "parallel sampling" (`n>1`). [Beam search was moved out of
534-
the core (in
535-
V0)](gh-issue:8306). There was a
526+
required for "parallel sampling" (`n>1`).
527+
[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a
536528
lot of complex code for a very uncommon feature.
537529

538530
In V1, with prefix caching being better (zero over head) and therefore
@@ -547,18 +539,18 @@ Some v0 metrics are only relevant in the context of "parallel
547539
sampling". This is where the `n` parameter in a request is used to
548540
request multiple completions from the same prompt.
549541

550-
As part of adding parallel sampling support in <gh-pr:10980> we should
542+
As part of adding parallel sampling support in <gh-pr:10980>, we should
551543
also add these metrics.
552544

553545
- `vllm:request_params_n` (Histogram)
554546

555-
Observes the value of the 'n' parameter of every finished request.
547+
Observes the value of the 'n' parameter of every finished request.
556548

557549
- `vllm:request_max_num_generation_tokens` (Histogram)
558550

559-
Observes the maximum output length of all sequences in every finished
560-
sequence group. In the absence of parallel sampling, this is
561-
equivalent to `vllm:request_generation_tokens`.
551+
Observes the maximum output length of all sequences in every finished
552+
sequence group. In the absence of parallel sampling, this is
553+
equivalent to `vllm:request_generation_tokens`.
562554

563555
### Speculative Decoding
564556

@@ -576,26 +568,23 @@ There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
576568
seculative decoding to v1. Other techniques will follow. We should
577569
revisit the v0 metrics in this context.
578570

579-
Note - we should probably expose acceptance rate as separate accepted
580-
and draft counters, like we do for prefix caching hit rate. Efficiency
581-
likely also needs similar treatment.
571+
!!! note
572+
We should probably expose acceptance rate as separate accepted
573+
and draft counters, like we do for prefix caching hit rate. Efficiency
574+
likely also needs similar treatment.
582575

583576
### Autoscaling and Load-balancing
584577

585578
A common use case for our metrics is to support automated scaling of
586579
vLLM instances.
587580

588-
For related discussion from the [Kubernetes Serving Working
589-
Group](https://github.com/kubernetes/community/tree/master/wg-serving),
581+
For related discussion from the
582+
[Kubernetes Serving Working Group](https://github.com/kubernetes/community/tree/master/wg-serving),
590583
see:
591584

592-
- [Standardizing Large Model Server Metrics in
593-
Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
594-
- [Benchmarking LLM Workloads for Performance Evaluation and
595-
Autoscaling in
596-
Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
597-
- [Inference
598-
Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
585+
- [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
586+
- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
587+
- [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
599588
- <gh-issue:5041> and <gh-pr:12726>.
600589

601590
This is a non-trivial topic. Consider this comment from Rob:
@@ -619,19 +608,16 @@ should judge an instance as approaching saturation:
619608

620609
Our approach to naming metrics probably deserves to be revisited:
621610

622-
1. The use of colons in metric names seems contrary to ["colons are
623-
reserved for user defined recording
624-
rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels)
611+
1. The use of colons in metric names seems contrary to
612+
["colons are reserved for user defined recording rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels).
625613
2. Most of our metrics follow the convention of ending with units, but
626614
not all do.
627615
3. Some of our metric names end with `_total`:
628616

629-
```
630-
If there is a suffix of `_total` on the metric name, it will be removed. When
631-
exposing the time series for counter, a `_total` suffix will be added. This is
632-
for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
633-
requires the `_total` suffix.
634-
```
617+
If there is a suffix of `_total` on the metric name, it will be removed. When
618+
exposing the time series for counter, a `_total` suffix will be added. This is
619+
for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
620+
requires the `_total` suffix.
635621

636622
### Adding More Metrics
637623

@@ -642,8 +628,7 @@ There is no shortage of ideas for new metrics:
642628
- Proposals arising from specific use cases, like the Kubernetes
643629
auto-scaling topic above
644630
- Proposals that might arise out of standardisation efforts like
645-
[OpenTelemetry Semantic Conventions for Gen
646-
AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
631+
[OpenTelemetry Semantic Conventions for Gen AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
647632

648633
We should be cautious in our approach to adding new metrics. While
649634
metrics are often relatively straightforward to add:
@@ -668,18 +653,14 @@ fall under the more general heading of "Observability".
668653
v0 has support for OpenTelemetry tracing:
669654

670655
- Added by <gh-pr:4687>
671-
- Configured with `--oltp-traces-endpoint` and
672-
`--collect-detailed-traces`
673-
- [OpenTelemetry blog
674-
post](https://opentelemetry.io/blog/2024/llm-observability/)
656+
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
657+
- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
675658
- [User-facing docs](../../examples/online_serving/opentelemetry.md)
676-
- [Blog
677-
post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
678-
- [IBM product
679-
docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
659+
- [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
660+
- [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
680661

681-
OpenTelemetry has a [Gen AI Working
682-
Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
662+
OpenTelemetry has a
663+
[Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
683664

684665
Since metrics is a big enough topic on its own, we are going to tackle
685666
the topic of tracing in v1 separately.
@@ -698,7 +679,7 @@ These metrics are only enabled when OpenTelemetry tracing is enabled
698679
and if `--collect-detailed-traces=all/model/worker` is used. The
699680
documentation for this option states:
700681

701-
> collect detailed traces for the specified "modules. This involves
682+
> collect detailed traces for the specified modules. This involves
702683
> use of possibly costly and or blocking operations and hence might
703684
> have a performance impact.
704685

0 commit comments

Comments
 (0)