@@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
5
5
## Objectives
6
6
7
7
- Achieve parity of metrics between v0 and v1.
8
- - The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments.
9
- - Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
8
+ - The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
9
+ - Logging support ( i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
10
10
11
11
## Background
12
12
13
13
Metrics in vLLM can be categorized as follows:
14
14
15
- 1 . Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
16
- 2 . Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
15
+ 1 . Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
16
+ 2 . Request-level metrics: Metrics that track the characteristics ( e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking.
17
17
18
- The mental model is that the "Server -level Metrics" explain why the "Request -level Metrics" are what they are .
18
+ The mental model is that server -level metrics help explain the values of request -level metrics .
19
19
20
20
### v0 Metrics
21
21
@@ -65,20 +65,20 @@ vLLM also provides [a reference example](../../examples/online_serving/prometheu
65
65
66
66
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
67
67
68
- - ` vllm:e2e_request_latency_seconds_bucket ` - End to end request latency measured in seconds
69
- - ` vllm:prompt_tokens_total ` - Prompt Tokens
70
- - ` vllm:generation_tokens_total ` - Generation Tokens
71
- - ` vllm:time_per_output_token_seconds ` - Inter token latency (Time Per Output Token, TPOT) in second .
68
+ - ` vllm:e2e_request_latency_seconds_bucket ` - End to end request latency measured in seconds.
69
+ - ` vllm:prompt_tokens_total ` - Prompt tokens.
70
+ - ` vllm:generation_tokens_total ` - Generation tokens.
71
+ - ` vllm:time_per_output_token_seconds ` - Inter- token latency (Time Per Output Token, TPOT) in seconds .
72
72
- ` vllm:time_to_first_token_seconds ` - Time to First Token (TTFT) latency in seconds.
73
- - ` vllm:num_requests_running ` (also, ` _swapped ` and ` _waiting ` ) - Number of requests in RUNNING, WAITING, and SWAPPED state
73
+ - ` vllm:num_requests_running ` (also, ` _swapped ` and ` _waiting ` ) - Number of requests in the RUNNING, WAITING, and SWAPPED states.
74
74
- ` vllm:gpu_cache_usage_perc ` - Percentage of used cache blocks by vLLM.
75
- - ` vllm:request_prompt_tokens ` - Request prompt length
76
- - ` vllm:request_generation_tokens ` - request generation length
77
- - ` vllm:request_success_total ` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached
78
- - ` vllm:request_queue_time_seconds ` - Queue Time
79
- - ` vllm:request_prefill_time_seconds ` - Requests Prefill Time
80
- - ` vllm:request_decode_time_seconds ` - Requests Decode Time
81
- - ` vllm:request_max_num_generation_tokens ` - Max Generation Token in Sequence Group
75
+ - ` vllm:request_prompt_tokens ` - Request prompt length.
76
+ - ` vllm:request_generation_tokens ` - Request generation length.
77
+ - ` vllm:request_success_total ` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
78
+ - ` vllm:request_queue_time_seconds ` - Queue time.
79
+ - ` vllm:request_prefill_time_seconds ` - Requests prefill time.
80
+ - ` vllm:request_decode_time_seconds ` - Requests decode time.
81
+ - ` vllm:request_max_num_generation_tokens ` - Max generation tokens in a sequence group.
82
82
83
83
See [ the PR which added this Dashboard] ( gh-pr:2316 ) for interesting and useful background on the choices made here.
84
84
@@ -103,7 +103,7 @@ In v0, metrics are collected in the engine core process and we use multi-process
103
103
104
104
### Built in Python/Process Metrics
105
105
106
- The following metrics are supported by default by ` prometheus_client ` , but the are not exposed with multiprocess mode is used:
106
+ The following metrics are supported by default by ` prometheus_client ` , but they are not exposed when multi-process mode is used:
107
107
108
108
- ` python_gc_objects_collected_total `
109
109
- ` python_gc_objects_uncollectable_total `
@@ -158,6 +158,7 @@ In v1, we wish to move computation and overhead out of the engine core
158
158
process to minimize the time between each forward pass.
159
159
160
160
The overall idea of V1 EngineCore design is:
161
+
161
162
- EngineCore is the inner loop. Performance is most critical here
162
163
- AsyncLLM is the outer loop. This is overlapped with GPU execution
163
164
(ideally), so this is where any "overheads" should be if
@@ -178,7 +179,7 @@ time" (`time.time()`) to calculate intervals as the former is
178
179
unaffected by system clock changes (e.g. from NTP).
179
180
180
181
It's also important to note that monotonic clocks differ between
181
- processes - each process has its own reference. point. So it is
182
+ processes - each process has its own reference point. So it is
182
183
meaningless to compare monotonic timestamps from different processes.
183
184
184
185
Therefore, in order to calculate an interval, we must compare two
@@ -343,14 +344,15 @@ vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.
343
344
vllm:time_to_first_token_seconds_count{model_name=" meta-llama/Llama-3.1-8B-Instruct" } 140.0
344
345
```
345
346
346
- Note - the choice of histogram buckets to be most useful to users
347
- across a broad set of use cases is not straightforward and will
348
- require refinement over time.
347
+ !!! note
348
+ The choice of histogram buckets to be most useful to users
349
+ across a broad set of use cases is not straightforward and will
350
+ require refinement over time.
349
351
350
352
### Cache Config Info
351
353
352
- ` prometheus_client ` has support for [ Info
353
- metrics] ( https://prometheus.github.io/client_python/instrumenting/info/ )
354
+ ` prometheus_client ` has support for
355
+ [ Info metrics] ( https://prometheus.github.io/client_python/instrumenting/info/ )
354
356
which are equivalent to a ` Gauge ` whose value is permanently set to 1,
355
357
but exposes interesting key/value pair information via labels. This is
356
358
used for information about an instance that does not change - so it
@@ -363,14 +365,11 @@ We use this concept for the `vllm:cache_config_info` metric:
363
365
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
364
366
# TYPE vllm:cache_config_info gauge
365
367
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
366
-
367
368
```
368
369
369
- However, ` prometheus_client ` has [ never supported Info metrics in
370
- multiprocessing
371
- mode] ( https://github.com/prometheus/client_python/pull/300 ) - for
372
- [ unclear
373
- reasons] ( gh-pr:7279#discussion_r1710417152 ) . We
370
+ However, ` prometheus_client ` has
371
+ [ never supported Info metrics in multiprocessing mode] ( https://github.com/prometheus/client_python/pull/300 ) -
372
+ for [ unclear reasons] ( gh-pr:7279#discussion_r1710417152 ) . We
374
373
simply use a ` Gauge ` metric set to 1 and
375
374
` multiprocess_mode="mostrecent" ` instead.
376
375
@@ -395,11 +394,9 @@ distinguish between per-adapter counts. This should be revisited.
395
394
Note that ` multiprocess_mode="livemostrecent" ` is used - the most
396
395
recent metric is used, but only from currently running processes.
397
396
398
- This was added in
399
- < gh-pr:9477 > and there is
400
- [ at least one known
401
- user] ( https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54 ) . If
402
- we revisit this design and deprecate the old metric, we should reduce
397
+ This was added in < gh-pr:9477 > and there is
398
+ [ at least one known user] ( https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54 ) .
399
+ If we revisit this design and deprecate the old metric, we should reduce
403
400
the need for a significant deprecation period by making the change in
404
401
v0 also and asking this project to move to the new metric.
405
402
@@ -442,23 +439,20 @@ suddenly (from their perspective) when it is removed, even if there is
442
439
an equivalent metric for them to use.
443
440
444
441
As an example, see how ` vllm:avg_prompt_throughput_toks_per_s ` was
445
- [ deprecated] ( gh-pr:2764 ) (with a
446
- comment in the code),
447
- [ removed] ( gh-pr:12383 ) , and then
448
- [ noticed by a
449
- user] ( gh-issue:13218 ) .
442
+ [ deprecated] ( gh-pr:2764 ) (with a comment in the code),
443
+ [ removed] ( gh-pr:12383 ) , and then [ noticed by a user] ( gh-issue:13218 ) .
450
444
451
445
In general:
452
446
453
- 1 ) We should be cautious about deprecating metrics, especially since
447
+ 1 . We should be cautious about deprecating metrics, especially since
454
448
it can be hard to predict the user impact.
455
- 2 ) We should include a prominent deprecation notice in the help string
449
+ 2 . We should include a prominent deprecation notice in the help string
456
450
that is included in the `/metrics' output.
457
- 3 ) We should list deprecated metrics in user-facing documentation and
451
+ 3 . We should list deprecated metrics in user-facing documentation and
458
452
release notes.
459
- 4 ) We should consider hiding deprecated metrics behind a CLI argument
460
- in order to give administrators [ an escape
461
- hatch] ( https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics )
453
+ 4 . We should consider hiding deprecated metrics behind a CLI argument
454
+ in order to give administrators
455
+ [ an escape hatch] ( https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics )
462
456
for some time before deleting them.
463
457
464
458
See the [ deprecation policy] ( ../../contributing/deprecation_policy.md ) for
@@ -474,15 +468,15 @@ removed.
474
468
The ` vllm:time_in_queue_requests ` Histogram metric was added by
475
469
< gh-pr:9659 > and its calculation is:
476
470
477
- ```
471
+ ``` python
478
472
self .metrics.first_scheduled_time = now
479
473
self .metrics.time_in_queue = now - self .metrics.arrival_time
480
474
```
481
475
482
476
Two weeks later, < gh-pr:4464 > added ` vllm:request_queue_time_seconds ` leaving
483
477
us with:
484
478
485
- ```
479
+ ``` python
486
480
if seq_group.is_finished():
487
481
if (seq_group.metrics.first_scheduled_time is not None and
488
482
seq_group.metrics.first_token_time is not None ):
@@ -517,8 +511,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
517
511
memory. This is also known as "KV cache offloading" and is configured
518
512
with ` --swap-space ` and ` --preemption-mode ` .
519
513
520
- In v0, [ vLLM has long supported beam
521
- search] ( gh-issue:6226 ) . The
514
+ In v0, [ vLLM has long supported beam search] ( gh-issue:6226 ) . The
522
515
SequenceGroup encapsulated the idea of N Sequences which
523
516
all shared the same prompt kv blocks. This enabled KV cache block
524
517
sharing between requests, and copy-on-write to do branching. CPU
@@ -530,9 +523,8 @@ option than CPU swapping since blocks can be evicted slowly on demand
530
523
and the part of the prompt that was evicted can be recomputed.
531
524
532
525
SequenceGroup was removed in V1, although a replacement will be
533
- required for "parallel sampling" (` n>1 ` ). [ Beam search was moved out of
534
- the core (in
535
- V0)] ( gh-issue:8306 ) . There was a
526
+ required for "parallel sampling" (` n>1 ` ).
527
+ [ Beam search was moved out of the core (in V0)] ( gh-issue:8306 ) . There was a
536
528
lot of complex code for a very uncommon feature.
537
529
538
530
In V1, with prefix caching being better (zero over head) and therefore
@@ -547,18 +539,18 @@ Some v0 metrics are only relevant in the context of "parallel
547
539
sampling". This is where the ` n ` parameter in a request is used to
548
540
request multiple completions from the same prompt.
549
541
550
- As part of adding parallel sampling support in < gh-pr:10980 > we should
542
+ As part of adding parallel sampling support in < gh-pr:10980 > , we should
551
543
also add these metrics.
552
544
553
545
- ` vllm:request_params_n ` (Histogram)
554
546
555
- Observes the value of the 'n' parameter of every finished request.
547
+ Observes the value of the 'n' parameter of every finished request.
556
548
557
549
- ` vllm:request_max_num_generation_tokens ` (Histogram)
558
550
559
- Observes the maximum output length of all sequences in every finished
560
- sequence group. In the absence of parallel sampling, this is
561
- equivalent to ` vllm:request_generation_tokens ` .
551
+ Observes the maximum output length of all sequences in every finished
552
+ sequence group. In the absence of parallel sampling, this is
553
+ equivalent to ` vllm:request_generation_tokens ` .
562
554
563
555
### Speculative Decoding
564
556
@@ -576,26 +568,23 @@ There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
576
568
seculative decoding to v1. Other techniques will follow. We should
577
569
revisit the v0 metrics in this context.
578
570
579
- Note - we should probably expose acceptance rate as separate accepted
580
- and draft counters, like we do for prefix caching hit rate. Efficiency
581
- likely also needs similar treatment.
571
+ !!! note
572
+ We should probably expose acceptance rate as separate accepted
573
+ and draft counters, like we do for prefix caching hit rate. Efficiency
574
+ likely also needs similar treatment.
582
575
583
576
### Autoscaling and Load-balancing
584
577
585
578
A common use case for our metrics is to support automated scaling of
586
579
vLLM instances.
587
580
588
- For related discussion from the [ Kubernetes Serving Working
589
- Group] ( https://github.com/kubernetes/community/tree/master/wg-serving ) ,
581
+ For related discussion from the
582
+ [ Kubernetes Serving Working Group] ( https://github.com/kubernetes/community/tree/master/wg-serving ) ,
590
583
see:
591
584
592
- - [ Standardizing Large Model Server Metrics in
593
- Kubernetes] ( https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk )
594
- - [ Benchmarking LLM Workloads for Performance Evaluation and
595
- Autoscaling in
596
- Kubernetes] ( https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ )
597
- - [ Inference
598
- Perf] ( https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf )
585
+ - [ Standardizing Large Model Server Metrics in Kubernetes] ( https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk )
586
+ - [ Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes] ( https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ )
587
+ - [ Inference Perf] ( https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf )
599
588
- < gh-issue:5041 > and < gh-pr:12726 > .
600
589
601
590
This is a non-trivial topic. Consider this comment from Rob:
@@ -619,19 +608,16 @@ should judge an instance as approaching saturation:
619
608
620
609
Our approach to naming metrics probably deserves to be revisited:
621
610
622
- 1 . The use of colons in metric names seems contrary to [ "colons are
623
- reserved for user defined recording
624
- rules"] ( https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels )
611
+ 1 . The use of colons in metric names seems contrary to
612
+ [ "colons are reserved for user defined recording rules"] ( https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels ) .
625
613
2 . Most of our metrics follow the convention of ending with units, but
626
614
not all do.
627
615
3 . Some of our metric names end with ` _total ` :
628
616
629
- ```
630
- If there is a suffix of `_total` on the metric name, it will be removed. When
631
- exposing the time series for counter, a `_total` suffix will be added. This is
632
- for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
633
- requires the `_total` suffix.
634
- ```
617
+ If there is a suffix of ` _total ` on the metric name, it will be removed. When
618
+ exposing the time series for counter, a ` _total ` suffix will be added. This is
619
+ for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
620
+ requires the ` _total ` suffix.
635
621
636
622
### Adding More Metrics
637
623
@@ -642,8 +628,7 @@ There is no shortage of ideas for new metrics:
642
628
- Proposals arising from specific use cases, like the Kubernetes
643
629
auto-scaling topic above
644
630
- Proposals that might arise out of standardisation efforts like
645
- [ OpenTelemetry Semantic Conventions for Gen
646
- AI] ( https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai ) .
631
+ [ OpenTelemetry Semantic Conventions for Gen AI] ( https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai ) .
647
632
648
633
We should be cautious in our approach to adding new metrics. While
649
634
metrics are often relatively straightforward to add:
@@ -668,18 +653,14 @@ fall under the more general heading of "Observability".
668
653
v0 has support for OpenTelemetry tracing:
669
654
670
655
- Added by < gh-pr:4687 >
671
- - Configured with ` --oltp-traces-endpoint ` and
672
- ` --collect-detailed-traces `
673
- - [ OpenTelemetry blog
674
- post] ( https://opentelemetry.io/blog/2024/llm-observability/ )
656
+ - Configured with ` --oltp-traces-endpoint ` and ` --collect-detailed-traces `
657
+ - [ OpenTelemetry blog post] ( https://opentelemetry.io/blog/2024/llm-observability/ )
675
658
- [ User-facing docs] ( ../../examples/online_serving/opentelemetry.md )
676
- - [ Blog
677
- post] ( https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f )
678
- - [ IBM product
679
- docs] ( https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview )
659
+ - [ Blog post] ( https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f )
660
+ - [ IBM product docs] ( https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview )
680
661
681
- OpenTelemetry has a [ Gen AI Working
682
- Group] ( https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md ) .
662
+ OpenTelemetry has a
663
+ [ Gen AI Working Group] ( https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md ) .
683
664
684
665
Since metrics is a big enough topic on its own, we are going to tackle
685
666
the topic of tracing in v1 separately.
@@ -698,7 +679,7 @@ These metrics are only enabled when OpenTelemetry tracing is enabled
698
679
and if ` --collect-detailed-traces=all/model/worker ` is used. The
699
680
documentation for this option states:
700
681
701
- > collect detailed traces for the specified " modules. This involves
682
+ > collect detailed traces for the specified modules. This involves
702
683
> use of possibly costly and or blocking operations and hence might
703
684
> have a performance impact.
704
685
0 commit comments