Skip to content

Conversation

@googs1025
Copy link
Contributor

@googs1025 googs1025 commented Sep 15, 2025

Add Prometheus metrics for some new metrics

  • vllm:request_success_total
  • vllm:request_params_max_tokens
  • vllm:request_prompt_tokens
  • vllm:request_generation_tokens

part of: #191

@googs1025
Copy link
Contributor Author

local test :

~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
{"id":"chatcmpl-ca971b3c-3613-40a2-8b6d-68c3f926d9d9","created":1757937648,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":21,"total_tokens":23},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Today it is partially cloudy and raining. I am your AI assistant, how can I help you today?"}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
{"id":"chatcmpl-ddd78d23-401d-4268-b73f-b8610e126e38","created":1757937648,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":49,"total_tokens":51},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . I am your AI assistant, how can I help you "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
{"id":"chatcmpl-5b7d2524-4bf4-4d3c-a262-6cd45233f6a5","created":1757937649,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":29,"total_tokens":31},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + ("}}]}%
➜  ~ curl http://localhost:8000/metrics
# HELP vllm:gpu_cache_usage_perc Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:lora_requests_info Running stats on lora requests.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="2",running_lora_adapters="",waiting_lora_adapters=""} 1.757937635e+09
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:num_requests_waiting Prometheus metric for the number of queued requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:request_generation_tokens Number of generated tokens so far in the request.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 3
vllm:request_generation_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 99
vllm:request_generation_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3
# HELP vllm:request_prompt_tokens Number of input prompt tokens in the request.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 3
vllm:request_prompt_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 6
vllm:request_prompt_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3
# HELP vllm:request_success_total Total number of successful inference requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finish_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 3

@googs1025 googs1025 marked this pull request as draft September 15, 2025 12:09
@googs1025
Copy link
Contributor Author

~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_completion_tokens": 1000
  }'
{"id":"chatcmpl-fc538625-ece7-46c0-8705-51f945f88d2d","created":1757940432,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":24,"total_tokens":27},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_completion_tokens": 1000
  }'
{"id":"chatcmpl-44910f93-7272-495f-b42e-3e55ac637bc4","created":1757940433,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":987,"total_tokens":990},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Today it is partially cloudy and raining. Today is a nice sunny day. The rest is silence.  To be or not to be that is the question. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today is a nice sunny day. Testing, testing 1,2,3. The rest is silence.  The rest is silence.  Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime To be or not to be that is the question. I am your AI assistant, how can I help you today? I am fine, how are you today? The rest is silence.  To be or not to be that is the question. The temperature here is twenty-five degrees centigrade. I am your AI assistant, how can I help you today? To be or not to be that is the question. I am fine, how are you today? I am fine, how are you today? Testing, testing 1,2,3. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. The temperature here is twenty-five degrees centigrade. I am fine, how are you today? Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . I am fine, how are you today? To be or not to be that is the question. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest I am your AI assistant, how can I help you today? To be or not to be that is the question. The rest is silence.  Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime I am fine, how are you today? The rest is silence.  To be or not to be that is the question. Testing, testing 1,2,3. Testing, testing 1,2,3. I am fine, how are you today? The rest is silence.  The rest is silence.  Today is a nice sunny day. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime The temperature here is twenty-five degrees centigrade. I am your AI assistant, how can I help you today? I am fine, how are you today? To be or not to be that is the question. Testing, testing 1,2,3. Testing, testing 1,2,3. Today is a nice sunny day. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest I am fine, how are you today? I am your AI assistant, how can I help you today? Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Testing, testing 1,2,3."}}]}%
➜  ~ curl http://localhost:8000/metrics
# HELP vllm:gpu_cache_usage_perc Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:lora_requests_info Running stats on lora requests.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="2",running_lora_adapters="",waiting_lora_adapters=""} 1.757937635e+09
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:num_requests_waiting Prometheus metric for the number of queued requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:request_generation_tokens Number of generated tokens so far in the request.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 1
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 1
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 10
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 20
vllm:request_generation_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8310
vllm:request_generation_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 20
# HELP vllm:request_params_max_tokens The 'max_tokens' parameter from the request.
# TYPE vllm:request_params_max_tokens histogram
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 13
vllm:request_params_max_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 12003
vllm:request_params_max_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 13
# HELP vllm:request_prompt_tokens Number of input prompt tokens in the request.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 20
vllm:request_prompt_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 57
vllm:request_prompt_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 20
# HELP vllm:request_success_total Total number of successful inference requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finish_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 4
vllm:request_success_total{finish_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 16
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-287a5371-aa6e-4613-9d40-c529e4c1988f","created":1757940456,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":52,"total_tokens":55},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The temperature here is twenty-five degrees centigrade. The rest is silence.  I am your AI assistant, how can I help you today? The temperature here is twenty-five degrees centigrade. Today it is partially cloudy and raining. Alas, poor Yorick! I "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-7928a5be-d9ef-4b00-a70c-6f8c5a5cc0ce","created":1757940457,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":25,"total_tokens":28},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Today it is partially cloudy and raining. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest To "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-5b4fe134-c142-47cd-afdd-c0091c2d5a2d","created":1757940458,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":12,"total_tokens":15},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The rest is silence.  Alas, poor Yorick! I knew "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-962297e6-cf4c-481f-9775-44d111555445","created":1757940459,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":41,"total_tokens":44},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To be or not to be that is the question. The temperature here is twenty-five degrees centigrade. Testing@, #testing 1$ ,2%,3^, [4\u0026*5]"}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-81ddaf0a-119d-4771-8995-3558274c3a74","created":1757940459,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":28,"total_tokens":31},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing, testing 1,2,3. The temperature here is twenty-five degrees centigrade. The temperature here is twenty-five degrees centigrade"}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-202ec739-1304-4cdb-aeb4-eea2b81559bb","created":1757940460,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":41,"total_tokens":44},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To be or not to be that is the question. I am fine, how are you today? Give a man a fish and you feed him for a day; teach a man to fish and you feed "}}]}%
➜  ~ curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
    ],
    "max_token": 1000
  }'
{"id":"chatcmpl-cef4a697-a00c-4e8c-bfdd-66bc4325c3bc","created":1757940460,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":40,"total_tokens":43},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The temperature here is twenty-five degrees centigrade. Today it is partially cloudy and raining. Testing, testing 1,2,3. Testing@, #testing 1$ ,2%,3^"}}]}%

@googs1025 googs1025 marked this pull request as ready for review September 15, 2025 12:52
Copy link
Collaborator

@mayabar mayabar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@googs1025 thank you for you contribution, I’ve left some comments for your review

@irar2
Copy link
Collaborator

irar2 commented Sep 16, 2025

@googs1025 Please also add the new metrics to the fake metrics, and set both fake and real initial values in setInitialPrometheusMetrics()
Also, please add tests for the new metrics.

@googs1025 googs1025 force-pushed the add_metrics branch 2 times, most recently from f562dff to d2d61c7 Compare September 16, 2025 11:06
@googs1025
Copy link
Contributor Author

@googs1025 Please also add the new metrics to the fake metrics, and set both fake and real initial values in setInitialPrometheusMetrics() Also, please add tests for the new metrics.

thanks 😄
I'm using a Mac and I'm running llm-d-inference-sim locally. There seem to be some issues with intallation zmq. I'll add some test cases after resolving these issues.

@mayabar
Copy link
Collaborator

mayabar commented Sep 17, 2025

I'm using a Mac and I'm running llm-d-inference-sim locally. There seem to be some issues with intallation zmq. I'll add some test cases after resolving these issues.

We are also working with Mac, installed zmq using homebrew

@googs1025 googs1025 force-pushed the add_metrics branch 10 times, most recently from bc7fbe1 to a6edf2d Compare September 20, 2025 14:20
@irar2
Copy link
Collaborator

irar2 commented Sep 21, 2025

@googs1025

The failure in the tests is due to a unit test that failed:
--- FAIL: TestBuild125Buckets (0.00s)
--- FAIL: TestBuild125Buckets/max_value_zero (0.00s)
metrics_test.go:680: build125Buckets(0) = [], want []
--- FAIL: TestBuild125Buckets/max_value_4096 (0.00s)
metrics_test.go:680: build125Buckets(4096) = [1 2 5 10 20 50 100 200 500 1000 2000], want [1 2 5 10 20 50 100 200 500 1000 2000 4000]
--- FAIL: TestBuild125Buckets/max_value_32768 (0.00s)
metrics_test.go:680: build125Buckets(32768) = [1 2 5 10 20 50 100 200 500 1000 2000 5000 10000 20000], want [1 2 5 10 20 50 100 200 500 1000 2000 5000 10000 20000 30000]
--- FAIL: TestBuild125Buckets/max_value_negative (0.00s)
metrics_test.go:680: build125Buckets(-1) = [], want []

@googs1025 googs1025 force-pushed the add_metrics branch 2 times, most recently from 0d7f03a to 597ed25 Compare September 21, 2025 09:53
@googs1025 googs1025 requested a review from mayabar September 23, 2025 15:10
length: 0
tool_calls: 0
remote_decode: 0
request-prompt-tokens: [ 10, 20, 30, 15 ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, this is an array of samples, if this is true - this solution is not scalable when we want to mimic situation of thousands of requests.

Lets define array in the configuration as number of samples in the bucket. So array [10, 20, 30, 15] will mean that we have 10 samples in range (-Inf, 1], 20 samples in range (1, 2], 30 samples in range (2, 5], and 15 samples in range (5, 10].

Note that histogram in prometheus is cumulative, so for the test above /metrics will report:
(-Inf, 1] - 10
(1, 2] - 20
(2, 5] - 60
(5, 10] - 75
(10, 20] - 75
(20, 50] - 75
(50, 100] - 75
(100, 200] - 75
(200, 500] - 75
(500, 1000] - 75
(1000, +Inf) - 75

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. 😄

prev := 0.0
for i, count := range counts {
boundary := boundaries[i]
// 在 (prev, boundary] 区间内取一个中间值作为样本代表
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please translate to english ;)

var samples []float64
prev := 0.0
for i, count := range counts {
boundary := boundaries[i]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that boundaries and counts arrays are of the same size.
Our situation is a little bit more complicated: in general, when we are looking on a histogram, number of boundaries is lees in 1 that number of buckets, e.g. for boundaries [1, 2, 5] there are 4 buckets: (-Inf, 1], (1, 2], (2, 5], (5, +Inf).
In our case we have well known buckets 1-2-5 up to the model len, for example for model len = 1024, boundaries will be [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]. But in the input fake values we do not force the user to define full array of counts, if there are zeros at the end of the array - user can skip it.
For example. we want to define histogram with 5 requests with 10 tokens and 5 requests with 40 tokens, the full counts array could be [0, 0, 0, 5, 0, 5, 0, 0, 0, 0, 0], but we allow the user to define counts as [0, 0, 0, 5, 0, 5]

  • Please update the implementation to support counts array longer than the boundaries array.
  • Please add support for the shorter count array version, or create an issue for the shorter array support and it will be implemented in a separate PR.

@googs1025
Copy link
Contributor Author

sorry for the late.
will fix comment this week

@mayabar
Copy link
Collaborator

mayabar commented Oct 16, 2025

@googs1025 this feature is needed for issue #211 . Do you think you can wrap up the pending changes this week? I’ll take care of the other time related metrics.

@googs1025
Copy link
Contributor Author

@googs1025 this feature is needed for issue #211 . Do you think you can wrap up the pending changes this week? I’ll take care of the other time related metrics.

will handle today

Copy link
Collaborator

@mayabar mayabar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@googs1025
Thanks a lot for you updates, I added some small comments.

@googs1025 googs1025 force-pushed the add_metrics branch 6 times, most recently from 3449fa9 to 08e6177 Compare October 20, 2025 07:12
@googs1025 googs1025 requested a review from mayabar October 20, 2025 07:31
@mayabar
Copy link
Collaborator

mayabar commented Oct 20, 2025

@googs1025 thanks for the updates, look great. Please pay attention that test fails

@mayabar
Copy link
Collaborator

mayabar commented Oct 21, 2025

@googs1025 Hi, we want to release a new simulator version with this feature, would you have time to fix the test today? Otherwise we will continue with this fix.

@googs1025
Copy link
Contributor Author

@googs1025 Hi, we want to release a new simulator version with this feature, would you have time to fix the test today? Otherwise we will continue with this fix.

will fix

Copy link
Collaborator

@mayabar mayabar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@googs1025 Added some comments on your latest changes

Signed-off-by: googs1025 <[email protected]>
Copy link
Collaborator

@mayabar mayabar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@mayabar mayabar merged commit 1c3d559 into llm-d:main Oct 22, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants