You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Create constants for all metrics
- Define all latency related fake metrics in config
- Add validation for new fake metrics in config
Signed-off-by: Maya Barnea <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,18 @@ In addition, it supports a subset of vLLM's Prometheus metrics. These metrics ar
26
26
| vllm:lora_requests_info| Running stats on LoRA requests |
27
27
| vllm:num_requests_running| Number of requests currently running on GPU |
28
28
| vllm:num_requests_waiting| Prometheus metric for the number of queued requests |
29
-
29
+
| vllm:e2e_request_latency_seconds| Histogram of end to end request latency in seconds |
30
+
| vllm:request_inference_time_seconds| Histogram of time spent in RUNNING phase for request |
31
+
| vllm:request_queue_time_seconds| Histogram of time spent in WAITING phase for request |
32
+
| vllm:request_prefill_time_seconds| Histogram of time spent in PREFILL phase for request |
33
+
| vllm:request_decode_time_seconds| Histogram of time spent in DECODE phase for request |
34
+
| vllm:time_to_first_token_seconds| Histogram of time to first token in seconds |
35
+
| vllm:time_per_output_token_seconds| Histogram of time per output token in seconds |
36
+
| vllm:request_generation_tokens| Number of generation tokens processed |
37
+
| vllm:request_params_max_tokens| Histogram of the max_tokens request parameter |
38
+
| vllm:request_prompt_tokens| Number of prefill tokens processed |
39
+
| vllm:request_success_total| Count of successfully processed requests |
40
+
30
41
The simulated inference has no connection with the model and LoRA adapters specified in the command line parameters or via the /v1/load_lora_adapter HTTP REST endpoint. The /v1/models endpoint returns simulated results based on those same command line parameters and those loaded via the /v1/load_lora_adapter HTTP REST endpoint.
0 commit comments