feat(metrics): add request prompt, generation, max_tokens and success metrics #202

googs1025 · 2025-09-15T12:06:44Z

Add Prometheus metrics for some new metrics

vllm:request_success_total
vllm:request_params_max_tokens
vllm:request_prompt_tokens
vllm:request_generation_tokens

part of: #191

googs1025 · 2025-09-15T12:07:15Z

local test :

➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
{"id":"chatcmpl-ca971b3c-3613-40a2-8b6d-68c3f926d9d9","created":1757937648,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":21,"total_tokens":23},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Today it is partially cloudy and raining. I am your AI assistant, how can I help you today?"}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
{"id":"chatcmpl-ddd78d23-401d-4268-b73f-b8610e126e38","created":1757937648,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":49,"total_tokens":51},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . I am your AI assistant, how can I help you "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
{"id":"chatcmpl-5b7d2524-4bf4-4d3c-a262-6cd45233f6a5","created":1757937649,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":29,"total_tokens":31},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + ("}}]}%
➜  ~ curl http://localhost:8000/metrics
# HELP vllm:gpu_cache_usage_perc Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:lora_requests_info Running stats on lora requests.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="2",running_lora_adapters="",waiting_lora_adapters=""} 1.757937635e+09
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:num_requests_waiting Prometheus metric for the number of queued requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:request_generation_tokens Number of generated tokens so far in the request.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 3
vllm:request_generation_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 99
vllm:request_generation_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3
# HELP vllm:request_prompt_tokens Number of input prompt tokens in the request.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 3
vllm:request_prompt_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 6
vllm:request_prompt_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3
# HELP vllm:request_success_total Total number of successful inference requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finish_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 3

googs1025 · 2025-09-15T12:50:20Z

➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_completion_tokens": 1000
  }'
{"id":"chatcmpl-fc538625-ece7-46c0-8705-51f945f88d2d","created":1757940432,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":24,"total_tokens":27},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_completion_tokens": 1000
  }'
{"id":"chatcmpl-44910f93-7272-495f-b42e-3e55ac637bc4","created":1757940433,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":987,"total_tokens":990},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Today it is partially cloudy and raining. Today is a nice sunny day. The rest is silence.  To be or not to be that is the question. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today is a nice sunny day. Testing, testing 1,2,3. The rest is silence.  The rest is silence.  Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime To be or not to be that is the question. I am your AI assistant, how can I help you today? I am fine, how are you today? The rest is silence.  To be or not to be that is the question. The temperature here is twenty-five degrees centigrade. I am your AI assistant, how can I help you today? To be or not to be that is the question. I am fine, how are you today? I am fine, how are you today? Testing, testing 1,2,3. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. The temperature here is twenty-five degrees centigrade. I am fine, how are you today? Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . I am fine, how are you today? To be or not to be that is the question. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest I am your AI assistant, how can I help you today? To be or not to be that is the question. The rest is silence.  Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime I am fine, how are you today? The rest is silence.  To be or not to be that is the question. Testing, testing 1,2,3. Testing, testing 1,2,3. I am fine, how are you today? The rest is silence.  The rest is silence.  Today is a nice sunny day. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime The temperature here is twenty-five degrees centigrade. I am your AI assistant, how can I help you today? I am fine, how are you today? To be or not to be that is the question. Testing, testing 1,2,3. Testing, testing 1,2,3. Today is a nice sunny day. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest I am fine, how are you today? I am your AI assistant, how can I help you today? Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Testing, testing 1,2,3."}}]}%
➜  ~ curl http://localhost:8000/metrics
# HELP vllm:gpu_cache_usage_perc Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:lora_requests_info Running stats on lora requests.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="2",running_lora_adapters="",waiting_lora_adapters=""} 1.757937635e+09
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:num_requests_waiting Prometheus metric for the number of queued requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:request_generation_tokens Number of generated tokens so far in the request.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 1
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 1
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 10
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 20
vllm:request_generation_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8310
vllm:request_generation_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 20
# HELP vllm:request_params_max_tokens The 'max_tokens' parameter from the request.
# TYPE vllm:request_params_max_tokens histogram
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 13
vllm:request_params_max_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 12003
vllm:request_params_max_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 13
# HELP vllm:request_prompt_tokens Number of input prompt tokens in the request.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 20
vllm:request_prompt_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 57
vllm:request_prompt_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 20
# HELP vllm:request_success_total Total number of successful inference requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finish_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 4
vllm:request_success_total{finish_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 16
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-287a5371-aa6e-4613-9d40-c529e4c1988f","created":1757940456,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":52,"total_tokens":55},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The temperature here is twenty-five degrees centigrade. The rest is silence.  I am your AI assistant, how can I help you today? The temperature here is twenty-five degrees centigrade. Today it is partially cloudy and raining. Alas, poor Yorick! I "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-7928a5be-d9ef-4b00-a70c-6f8c5a5cc0ce","created":1757940457,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":25,"total_tokens":28},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Today it is partially cloudy and raining. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest To "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-5b4fe134-c142-47cd-afdd-c0091c2d5a2d","created":1757940458,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":12,"total_tokens":15},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The rest is silence.  Alas, poor Yorick! I knew "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-962297e6-cf4c-481f-9775-44d111555445","created":1757940459,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":41,"total_tokens":44},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To be or not to be that is the question. The temperature here is twenty-five degrees centigrade. Testing@, #testing 1$ ,2%,3^, [4\u0026*5]"}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-81ddaf0a-119d-4771-8995-3558274c3a74","created":1757940459,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":28,"total_tokens":31},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing, testing 1,2,3. The temperature here is twenty-five degrees centigrade. The temperature here is twenty-five degrees centigrade"}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-202ec739-1304-4cdb-aeb4-eea2b81559bb","created":1757940460,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":41,"total_tokens":44},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To be or not to be that is the question. I am fine, how are you today? Give a man a fish and you feed him for a day; teach a man to fish and you feed "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-cef4a697-a00c-4e8c-bfdd-66bc4325c3bc","created":1757940460,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":40,"total_tokens":43},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The temperature here is twenty-five degrees centigrade. Today it is partially cloudy and raining. Testing, testing 1,2,3. Testing@, #testing 1$ ,2%,3^"}}]}%

pkg/llm-d-inference-sim/simulator.go

mayabar

@googs1025 thank you for you contribution, I’ve left some comments for your review

pkg/llm-d-inference-sim/metrics.go

pkg/llm-d-inference-sim/simulator.go

irar2 · 2025-09-16T06:21:22Z

@googs1025 Please also add the new metrics to the fake metrics, and set both fake and real initial values in setInitialPrometheusMetrics()
Also, please add tests for the new metrics.

googs1025 · 2025-09-16T11:10:14Z

@googs1025 Please also add the new metrics to the fake metrics, and set both fake and real initial values in setInitialPrometheusMetrics() Also, please add tests for the new metrics.

thanks 😄
I'm using a Mac and I'm running llm-d-inference-sim locally. There seem to be some issues with intallation zmq. I'll add some test cases after resolving these issues.

mayabar · 2025-09-17T07:16:51Z

I'm using a Mac and I'm running llm-d-inference-sim locally. There seem to be some issues with intallation zmq. I'll add some test cases after resolving these issues.

We are also working with Mac, installed zmq using homebrew

pkg/llm-d-inference-sim/metrics.go

pkg/common/config.go

pkg/llm-d-inference-sim/metrics.go

pkg/llm-d-inference-sim/simulator.go

irar2 · 2025-09-21T08:14:41Z

@googs1025

The failure in the tests is due to a unit test that failed:
--- FAIL: TestBuild125Buckets (0.00s)
--- FAIL: TestBuild125Buckets/max_value_zero (0.00s)
metrics_test.go:680: build125Buckets(0) = [], want []
--- FAIL: TestBuild125Buckets/max_value_4096 (0.00s)
metrics_test.go:680: build125Buckets(4096) = [1 2 5 10 20 50 100 200 500 1000 2000], want [1 2 5 10 20 50 100 200 500 1000 2000 4000]
--- FAIL: TestBuild125Buckets/max_value_32768 (0.00s)
metrics_test.go:680: build125Buckets(32768) = [1 2 5 10 20 50 100 200 500 1000 2000 5000 10000 20000], want [1 2 5 10 20 50 100 200 500 1000 2000 5000 10000 20000 30000]
--- FAIL: TestBuild125Buckets/max_value_negative (0.00s)
metrics_test.go:680: build125Buckets(-1) = [], want []

pkg/common/config.go

pkg/llm-d-inference-sim/metrics_test.go

mayabar · 2025-09-25T06:49:45Z

manifests/config_with_fake.yaml

+    length: 0
+    tool_calls: 0
+    remote_decode: 0
+  request-prompt-tokens: [ 10, 20, 30, 15 ]


As I understand, this is an array of samples, if this is true - this solution is not scalable when we want to mimic situation of thousands of requests.

Lets define array in the configuration as number of samples in the bucket. So array [10, 20, 30, 15] will mean that we have 10 samples in range (-Inf, 1], 20 samples in range (1, 2], 30 samples in range (2, 5], and 15 samples in range (5, 10].

Note that histogram in prometheus is cumulative, so for the test above /metrics will report:
(-Inf, 1] - 10
(1, 2] - 20
(2, 5] - 60
(5, 10] - 75
(10, 20] - 75
(20, 50] - 75
(50, 100] - 75
(100, 200] - 75
(200, 500] - 75
(500, 1000] - 75
(1000, +Inf) - 75

pkg/llm-d-inference-sim/metrics.go

pkg/llm-d-inference-sim/metrics_test.go

mayabar · 2025-09-28T11:03:57Z

pkg/llm-d-inference-sim/metrics.go

+	prev := 0.0
+	for i, count := range counts {
+		boundary := boundaries[i]
+		// 在 (prev, boundary] 区间内取一个中间值作为样本代表


please translate to english ;)

mayabar · 2025-09-28T11:13:52Z

pkg/llm-d-inference-sim/metrics.go

+	var samples []float64
+	prev := 0.0
+	for i, count := range counts {
+		boundary := boundaries[i]


This assumes that boundaries and counts arrays are of the same size.
Our situation is a little bit more complicated: in general, when we are looking on a histogram, number of boundaries is lees in 1 that number of buckets, e.g. for boundaries [1, 2, 5] there are 4 buckets: (-Inf, 1], (1, 2], (2, 5], (5, +Inf).
In our case we have well known buckets 1-2-5 up to the model len, for example for model len = 1024, boundaries will be [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]. But in the input fake values we do not force the user to define full array of counts, if there are zeros at the end of the array - user can skip it.
For example. we want to define histogram with 5 requests with 10 tokens and 5 requests with 40 tokens, the full counts array could be [0, 0, 0, 5, 0, 5, 0, 0, 0, 0, 0], but we allow the user to define counts as [0, 0, 0, 5, 0, 5]

Please update the implementation to support counts array longer than the boundaries array.

Please add support for the shorter count array version, or create an issue for the shorter array support and it will be implemented in a separate PR.

googs1025 · 2025-10-08T11:39:16Z

sorry for the late.
will fix comment this week

mayabar · 2025-10-16T04:53:09Z

@googs1025 this feature is needed for issue #211 . Do you think you can wrap up the pending changes this week? I’ll take care of the other time related metrics.

googs1025 · 2025-10-17T03:57:29Z

@googs1025 this feature is needed for issue #211 . Do you think you can wrap up the pending changes this week? I’ll take care of the other time related metrics.

will handle today

mayabar

@googs1025
Thanks a lot for you updates, I added some small comments.

pkg/common/config.go

pkg/common/config_test.go

pkg/common/config.go

pkg/llm-d-inference-sim/metrics.go

pkg/llm-d-inference-sim/metrics_test.go

… metrics Signed-off-by: googs1025 <[email protected]>

Signed-off-by: googs1025 <[email protected]>

mayabar · 2025-10-20T18:04:11Z

@googs1025 thanks for the updates, look great. Please pay attention that test fails

mayabar · 2025-10-21T08:10:32Z

@googs1025 Hi, we want to release a new simulator version with this feature, would you have time to fix the test today? Otherwise we will continue with this fix.

googs1025 · 2025-10-21T09:28:04Z

@googs1025 Hi, we want to release a new simulator version with this feature, would you have time to fix the test today? Otherwise we will continue with this fix.

will fix

mayabar

@googs1025 Added some comments on your latest changes

manifests/config_with_fake.yaml

pkg/llm-d-inference-sim/metrics_test.go

Signed-off-by: googs1025 <[email protected]>

mayabar

/lgtm
/approve

googs1025 marked this pull request as draft September 15, 2025 12:09

googs1025 marked this pull request as ready for review September 15, 2025 12:52

googs1025 force-pushed the add_metrics branch from 92e1735 to 85e9525 Compare September 15, 2025 23:14

googs1025 commented Sep 16, 2025

View reviewed changes

pkg/llm-d-inference-sim/simulator.go Outdated Show resolved Hide resolved

mayabar requested changes Sep 16, 2025

View reviewed changes

googs1025 force-pushed the add_metrics branch 2 times, most recently from f562dff to d2d61c7 Compare September 16, 2025 11:06

mayabar requested changes Sep 17, 2025

View reviewed changes

googs1025 force-pushed the add_metrics branch 10 times, most recently from bc7fbe1 to a6edf2d Compare September 20, 2025 14:20

googs1025 force-pushed the add_metrics branch 2 times, most recently from 0d7f03a to 597ed25 Compare September 21, 2025 09:53

googs1025 requested a review from mayabar September 23, 2025 15:10

mayabar requested changes Sep 25, 2025

View reviewed changes

mayabar requested changes Sep 28, 2025

View reviewed changes

vishakha-ramani mentioned this pull request Oct 15, 2025

Required Metrics for Workload Variant Autoscaler (WVA) Integration with llm-d-inference-sim #211

Closed

googs1025 force-pushed the add_metrics branch from 3541a37 to 02bc5c3 Compare October 17, 2025 11:15

googs1025 requested a review from mayabar October 18, 2025 01:18

mayabar requested changes Oct 19, 2025

View reviewed changes

googs1025 added 2 commits October 20, 2025 13:54

feat(metrics): add request prompt, generation, max_tokens and success…

4122fb7

… metrics Signed-off-by: googs1025 <[email protected]>

fix review comment

8c622c6

Signed-off-by: googs1025 <[email protected]>

googs1025 force-pushed the add_metrics branch 6 times, most recently from 3449fa9 to 08e6177 Compare October 20, 2025 07:12

googs1025 requested a review from mayabar October 20, 2025 07:31

googs1025 force-pushed the add_metrics branch from 08e6177 to e9d2eca Compare October 21, 2025 00:55

googs1025 force-pushed the add_metrics branch from e9d2eca to 44251ad Compare October 21, 2025 09:07

mayabar reviewed Oct 21, 2025

View reviewed changes

manifests/config_with_fake.yaml Outdated Show resolved Hide resolved

pkg/llm-d-inference-sim/metrics_test.go Show resolved Hide resolved

googs1025 force-pushed the add_metrics branch from 44251ad to 6705df6 Compare October 21, 2025 12:39

fix conflict

f8adf26

Signed-off-by: googs1025 <[email protected]>

googs1025 force-pushed the add_metrics branch from 6705df6 to f8adf26 Compare October 22, 2025 05:28

mayabar approved these changes Oct 22, 2025

View reviewed changes

mayabar merged commit 1c3d559 into llm-d:main Oct 22, 2025
4 checks passed

feat(metrics): add request prompt, generation, max_tokens and success metrics #202

feat(metrics): add request prompt, generation, max_tokens and success metrics #202

Uh oh!

Conversation

googs1025 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

googs1025 commented Sep 15, 2025

Uh oh!

googs1025 commented Sep 15, 2025

Uh oh!

Uh oh!

mayabar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irar2 commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

googs1025 commented Sep 16, 2025

Uh oh!

mayabar commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irar2 commented Sep 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mayabar Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

googs1025 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mayabar Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

mayabar Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

googs1025 commented Oct 8, 2025

Uh oh!

mayabar commented Oct 16, 2025

Uh oh!

googs1025 commented Oct 17, 2025

Uh oh!

mayabar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mayabar commented Oct 20, 2025

Uh oh!

mayabar commented Oct 21, 2025

Uh oh!

googs1025 commented Oct 21, 2025

Uh oh!

mayabar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mayabar left a comment

googs1025 commented Sep 15, 2025 •

edited

Loading

irar2 commented Sep 16, 2025 •

edited

Loading