-
Notifications
You must be signed in to change notification settings - Fork 37
feat(metrics): add request prompt, generation, max_tokens and success metrics #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
local test : ➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
{"id":"chatcmpl-ca971b3c-3613-40a2-8b6d-68c3f926d9d9","created":1757937648,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":21,"total_tokens":23},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Today it is partially cloudy and raining. I am your AI assistant, how can I help you today?"}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
{"id":"chatcmpl-ddd78d23-401d-4268-b73f-b8610e126e38","created":1757937648,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":49,"total_tokens":51},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . I am your AI assistant, how can I help you "}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
{"id":"chatcmpl-5b7d2524-4bf4-4d3c-a262-6cd45233f6a5","created":1757937649,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":2,"completion_tokens":29,"total_tokens":31},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + ("}}]}%
➜ ~ curl http://localhost:8000/metrics
# HELP vllm:gpu_cache_usage_perc Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:lora_requests_info Running stats on lora requests.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="2",running_lora_adapters="",waiting_lora_adapters=""} 1.757937635e+09
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:num_requests_waiting Prometheus metric for the number of queued requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:request_generation_tokens Number of generated tokens so far in the request.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 3
vllm:request_generation_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 99
vllm:request_generation_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3
# HELP vllm:request_prompt_tokens Number of input prompt tokens in the request.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 3
vllm:request_prompt_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 6
vllm:request_prompt_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 3
# HELP vllm:request_success_total Total number of successful inference requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finish_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 3
|
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_completion_tokens": 1000
}'
{"id":"chatcmpl-fc538625-ece7-46c0-8705-51f945f88d2d","created":1757940432,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":24,"total_tokens":27},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, "}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_completion_tokens": 1000
}'
{"id":"chatcmpl-44910f93-7272-495f-b42e-3e55ac637bc4","created":1757940433,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":987,"total_tokens":990},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Today it is partially cloudy and raining. Today is a nice sunny day. The rest is silence. To be or not to be that is the question. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today is a nice sunny day. Testing, testing 1,2,3. The rest is silence. The rest is silence. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime To be or not to be that is the question. I am your AI assistant, how can I help you today? I am fine, how are you today? The rest is silence. To be or not to be that is the question. The temperature here is twenty-five degrees centigrade. I am your AI assistant, how can I help you today? To be or not to be that is the question. I am fine, how are you today? I am fine, how are you today? Testing, testing 1,2,3. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. The temperature here is twenty-five degrees centigrade. I am fine, how are you today? Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . I am fine, how are you today? To be or not to be that is the question. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest I am your AI assistant, how can I help you today? To be or not to be that is the question. The rest is silence. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime I am fine, how are you today? The rest is silence. To be or not to be that is the question. Testing, testing 1,2,3. Testing, testing 1,2,3. I am fine, how are you today? The rest is silence. The rest is silence. Today is a nice sunny day. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime The temperature here is twenty-five degrees centigrade. I am your AI assistant, how can I help you today? I am fine, how are you today? To be or not to be that is the question. Testing, testing 1,2,3. Testing, testing 1,2,3. Today is a nice sunny day. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest I am fine, how are you today? I am your AI assistant, how can I help you today? Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^, [4\u0026*5], 6~, 7-_ + (8 : 9) / \\ \u003c \u003e . Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Today it is partially cloudy and raining. Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime Testing, testing 1,2,3."}}]}%
➜ ~ curl http://localhost:8000/metrics
# HELP vllm:gpu_cache_usage_perc Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:lora_requests_info Running stats on lora requests.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="2",running_lora_adapters="",waiting_lora_adapters=""} 1.757937635e+09
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:num_requests_waiting Prometheus metric for the number of queued requests.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0
# HELP vllm:request_generation_tokens Number of generated tokens so far in the request.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 1
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 1
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 3
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 10
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 12
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 20
vllm:request_generation_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 20
vllm:request_generation_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8310
vllm:request_generation_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 20
# HELP vllm:request_params_max_tokens The 'max_tokens' parameter from the request.
# TYPE vllm:request_params_max_tokens histogram
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 0
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 1
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 13
vllm:request_params_max_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 13
vllm:request_params_max_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 12003
vllm:request_params_max_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 13
# HELP vllm:request_prompt_tokens Number of input prompt tokens in the request.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1"} 0
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2"} 3
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="20"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="50"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="100"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="200"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="500"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="1000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="2000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="5000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="10000"} 20
vllm:request_prompt_tokens_bucket{model_name="meta-llama/Llama-3.1-8B-Instruct",le="+Inf"} 20
vllm:request_prompt_tokens_sum{model_name="meta-llama/Llama-3.1-8B-Instruct"} 57
vllm:request_prompt_tokens_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 20
# HELP vllm:request_success_total Total number of successful inference requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finish_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 4
vllm:request_success_total{finish_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 16
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_token": 1000
}'
{"id":"chatcmpl-287a5371-aa6e-4613-9d40-c529e4c1988f","created":1757940456,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":52,"total_tokens":55},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The temperature here is twenty-five degrees centigrade. The rest is silence. I am your AI assistant, how can I help you today? The temperature here is twenty-five degrees centigrade. Today it is partially cloudy and raining. Alas, poor Yorick! I "}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_token": 1000
}'
{"id":"chatcmpl-7928a5be-d9ef-4b00-a70c-6f8c5a5cc0ce","created":1757940457,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":25,"total_tokens":28},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Today it is partially cloudy and raining. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest To "}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_token": 1000
}'
{"id":"chatcmpl-5b4fe134-c142-47cd-afdd-c0091c2d5a2d","created":1757940458,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":12,"total_tokens":15},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The rest is silence. Alas, poor Yorick! I knew "}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_token": 1000
}'
{"id":"chatcmpl-962297e6-cf4c-481f-9775-44d111555445","created":1757940459,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":41,"total_tokens":44},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To be or not to be that is the question. The temperature here is twenty-five degrees centigrade. Testing@, #testing 1$ ,2%,3^, [4\u0026*5]"}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_token": 1000
}'
{"id":"chatcmpl-81ddaf0a-119d-4771-8995-3558274c3a74","created":1757940459,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":28,"total_tokens":31},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Testing, testing 1,2,3. The temperature here is twenty-five degrees centigrade. The temperature here is twenty-five degrees centigrade"}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_token": 1000
}'
{"id":"chatcmpl-202ec739-1304-4cdb-aeb4-eea2b81559bb","created":1757940460,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":41,"total_tokens":44},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To be or not to be that is the question. I am fine, how are you today? Give a man a fish and you feed him for a day; teach a man to fish and you feed "}}]}%
➜ ~ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!ssssssssssssssssssssssssssssssssssssssssssssssssssss"}
],
"max_token": 1000
}'
{"id":"chatcmpl-cef4a697-a00c-4e8c-bfdd-66bc4325c3bc","created":1757940460,"model":"meta-llama/Llama-3.1-8B-Instruct","usage":{"prompt_tokens":3,"completion_tokens":40,"total_tokens":43},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The temperature here is twenty-five degrees centigrade. Today it is partially cloudy and raining. Testing, testing 1,2,3. Testing@, #testing 1$ ,2%,3^"}}]}%
|
92e1735 to
85e9525
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@googs1025 thank you for you contribution, I’ve left some comments for your review
|
@googs1025 Please also add the new metrics to the fake metrics, and set both fake and real initial values in setInitialPrometheusMetrics() |
f562dff to
d2d61c7
Compare
thanks 😄 |
We are also working with Mac, installed zmq using homebrew |
bc7fbe1 to
a6edf2d
Compare
|
The failure in the tests is due to a unit test that failed: |
0d7f03a to
597ed25
Compare
| length: 0 | ||
| tool_calls: 0 | ||
| remote_decode: 0 | ||
| request-prompt-tokens: [ 10, 20, 30, 15 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, this is an array of samples, if this is true - this solution is not scalable when we want to mimic situation of thousands of requests.
Lets define array in the configuration as number of samples in the bucket. So array [10, 20, 30, 15] will mean that we have 10 samples in range (-Inf, 1], 20 samples in range (1, 2], 30 samples in range (2, 5], and 15 samples in range (5, 10].
Note that histogram in prometheus is cumulative, so for the test above /metrics will report:
(-Inf, 1] - 10
(1, 2] - 20
(2, 5] - 60
(5, 10] - 75
(10, 20] - 75
(20, 50] - 75
(50, 100] - 75
(100, 200] - 75
(200, 500] - 75
(500, 1000] - 75
(1000, +Inf) - 75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm. 😄
pkg/llm-d-inference-sim/metrics.go
Outdated
| prev := 0.0 | ||
| for i, count := range counts { | ||
| boundary := boundaries[i] | ||
| // 在 (prev, boundary] 区间内取一个中间值作为样本代表 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please translate to english ;)
pkg/llm-d-inference-sim/metrics.go
Outdated
| var samples []float64 | ||
| prev := 0.0 | ||
| for i, count := range counts { | ||
| boundary := boundaries[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that boundaries and counts arrays are of the same size.
Our situation is a little bit more complicated: in general, when we are looking on a histogram, number of boundaries is lees in 1 that number of buckets, e.g. for boundaries [1, 2, 5] there are 4 buckets: (-Inf, 1], (1, 2], (2, 5], (5, +Inf).
In our case we have well known buckets 1-2-5 up to the model len, for example for model len = 1024, boundaries will be [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]. But in the input fake values we do not force the user to define full array of counts, if there are zeros at the end of the array - user can skip it.
For example. we want to define histogram with 5 requests with 10 tokens and 5 requests with 40 tokens, the full counts array could be [0, 0, 0, 5, 0, 5, 0, 0, 0, 0, 0], but we allow the user to define counts as [0, 0, 0, 5, 0, 5]
- Please update the implementation to support counts array longer than the boundaries array.
- Please add support for the shorter count array version, or create an issue for the shorter array support and it will be implemented in a separate PR.
|
sorry for the late. |
|
@googs1025 this feature is needed for issue #211 . Do you think you can wrap up the pending changes this week? I’ll take care of the other time related metrics. |
will handle today |
3541a37 to
02bc5c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@googs1025
Thanks a lot for you updates, I added some small comments.
… metrics Signed-off-by: googs1025 <[email protected]>
Signed-off-by: googs1025 <[email protected]>
3449fa9 to
08e6177
Compare
|
@googs1025 thanks for the updates, look great. Please pay attention that test fails |
08e6177 to
e9d2eca
Compare
|
@googs1025 Hi, we want to release a new simulator version with this feature, would you have time to fix the test today? Otherwise we will continue with this fix. |
e9d2eca to
44251ad
Compare
will fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@googs1025 Added some comments on your latest changes
44251ad to
6705df6
Compare
Signed-off-by: googs1025 <[email protected]>
6705df6 to
f8adf26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Add Prometheus metrics for some new metrics
part of: #191