You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support vllm:max_num_generation_tokens metrics (#250)
* Add initial support for vllm:max_num_generation_tokens metrics, since we never return responses with more than one choice, the implementation is basic. Once 'n' request property will be supported - need to change to support real maximum. Added support in fake metrics. Tests added too.
Signed-off-by: Maya Barnea <[email protected]>
* update readme
Signed-off-by: Maya Barnea <[email protected]>
* Fix and extend 'Simulator configuration' test: add expected error message and fix, fix arguments in invalid configuration tests.
Fix validation of ttft and tpot fake definitions.
Signed-off-by: Maya Barnea <[email protected]>
* Change zmq-max-connect-attempts to int to get nicer error message, fix invalid lora test in config, add missing comments
Signed-off-by: Maya Barnea <[email protected]>
* fix typo
Signed-off-by: Maya Barnea <[email protected]>
* fix typo
Signed-off-by: Maya Barnea <[email protected]>
* fix compilation error
Signed-off-by: Maya Barnea <[email protected]>
---------
Signed-off-by: Maya Barnea <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,7 @@ In addition, it supports a subset of vLLM's Prometheus metrics. These metrics ar
34
34
| vllm:time_to_first_token_seconds| Histogram of time to first token in seconds |
35
35
| vllm:time_per_output_token_seconds| Histogram of time per output token in seconds |
36
36
| vllm:request_generation_tokens| Number of generation tokens processed |
37
+
| vllm:max_num_generation_tokens| Maximum number of requested generation tokens. Currently same as `vllm:request_generation_tokens` since always only one choice is returned |
37
38
| vllm:request_params_max_tokens| Histogram of the max_tokens request parameter |
38
39
| vllm:request_prompt_tokens| Number of prefill tokens processed |
39
40
| vllm:request_success_total| Count of successfully processed requests |
@@ -235,6 +236,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
235
236
960.0, 1920.0, 7680.0, +Inf.
236
237
-`request-prompt-tokens` - array of values for prompt-length buckets
237
238
-`request-generation-tokens` - array of values for generation-length buckets
239
+
-`request-max-generation-tokens` - array of values for max_num_generation_tokens buckets
238
240
-`request-params-max-tokens` - array of values for max_tokens parameter buckets
239
241
-`request-success-total` - number of successful requests per finish reason, key: finish-reason (stop, length, etc.).
0 commit comments