@@ -223,6 +223,20 @@ curl localhost:8002/metrics
223223VLLM stats are reported by the metrics endpoint in fields that are prefixed with
224224` vllm: ` . Triton currently supports reporting of the following metrics from vLLM.
225225``` bash
226+ # Number of requests currently running on GPU.
227+ gauge_scheduler_running
228+ # Number of requests waiting to be processed.
229+ gauge_scheduler_waiting
230+ # Number of requests swapped to CPU.
231+ gauge_scheduler_swapped
232+ # GPU KV-cache usage. 1 means 100 percent usage.
233+ gauge_gpu_cache_usage
234+ # CPU KV-cache usage. 1 means 100 percent usage.
235+ gauge_cpu_cache_usage
236+ # CPU prefix cache block hit rate.
237+ gauge_cpu_prefix_cache_hit_rate
238+ # GPU prefix cache block hit rate.
239+ gauge_gpu_prefix_cache_hit_rate
226240# Number of prefill tokens processed.
227241counter_prompt_tokens
228242# Number of generation tokens processed.
@@ -253,20 +267,6 @@ histogram_num_generation_tokens_request
253267histogram_best_of_request
254268# Histogram of the n request parameter.
255269histogram_n_request
256- # Number of requests currently running on GPU.
257- gauge_scheduler_running
258- # Number of requests waiting to be processed.
259- gauge_scheduler_waiting
260- # Number of requests swapped to CPU.
261- gauge_scheduler_swapped
262- # GPU KV-cache usage. 1 means 100 percent usage.
263- gauge_gpu_cache_usage
264- # CPU KV-cache usage. 1 means 100 percent usage.
265- gauge_cpu_cache_usage
266- # CPU prefix cache block hit rate.
267- gauge_cpu_prefix_cache_hit_rate
268- # GPU prefix cache block hit rate.
269- gauge_gpu_prefix_cache_hit_rate
270270```
271271Your output for these fields should look similar to the following:
272272``` bash
0 commit comments