@@ -203,7 +203,7 @@ you need to specify a different `shm-region-prefix-name` for each server. See
203203for more information.
204204
205205## Triton Metrics
206- Starting with the 24.08 release of Triton, users can now obtain partial
206+ Starting with the 24.08 release of Triton, users can now obtain specific
207207vLLM metrics by querying the Triton metrics endpoint (see complete vLLM metrics
208208[ here] ( https://docs.vllm.ai/en/latest/serving/metrics.html ) ). This can be
209209accomplished by launching a Triton server in any of the ways described above
@@ -213,16 +213,42 @@ the following:
213213``` bash
214214curl localhost:8002/metrics
215215```
216- VLLM stats are reported by the metrics endpoint in fields that
217- are prefixed with ` vllm: ` . Your output for these fields should look
218- similar to the following:
216+ VLLM stats are reported by the metrics endpoint in fields that are prefixed with
217+ ` vllm: ` . Triton currently supports reporting of the following metrics from vLLM.
218+ ``` bash
219+ # Number of prefill tokens processed.
220+ counter_prompt_tokens
221+ # Number of generation tokens processed.
222+ counter_generation_tokens
223+ # Histogram of time to first token in seconds.
224+ histogram_time_to_first_token
225+ # Histogram of time per output token in seconds.
226+ histogram_time_per_output_token
227+ ```
228+ Your output for these fields should look similar to the following:
219229``` bash
220230# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
221231# TYPE vllm:prompt_tokens_total counter
222232vllm:prompt_tokens_total{model=" vllm_model" ,version=" 1" } 10
223233# HELP vllm:generation_tokens_total Number of generation tokens processed.
224234# TYPE vllm:generation_tokens_total counter
225235vllm:generation_tokens_total{model=" vllm_model" ,version=" 1" } 16
236+ # HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
237+ # TYPE vllm:time_to_first_token_seconds histogram
238+ vllm:time_to_first_token_seconds_count{model=" vllm_model" ,version=" 1" } 1
239+ vllm:time_to_first_token_seconds_sum{model=" vllm_model" ,version=" 1" } 0.03233122825622559
240+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.001" } 0
241+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.005" } 0
242+ ...
243+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
244+ # HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
245+ # TYPE vllm:time_per_output_token_seconds histogram
246+ vllm:time_per_output_token_seconds_count{model=" vllm_model" ,version=" 1" } 15
247+ vllm:time_per_output_token_seconds_sum{model=" vllm_model" ,version=" 1" } 0.04501533508300781
248+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.01" } 14
249+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.025" } 15
250+ ...
251+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 15
226252```
227253To enable vLLM engine colleting metrics, "disable_log_stats" option need to be either false
228254or left empty (false by default) in [ model.json] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json ) .
0 commit comments