Skip to content

nv_inference_request_duration_us metrics have seemingly wrong values if vllm_backend is used #8463

@ahakanbaba

Description

@ahakanbaba

Description
Using the 25.03 Triton container with the other applicable versions from the Nvidia Framework Support Matrix
For example vllm_backend (r25.03)

When I get the response from the /metrics endpoint from triton, the nv_inference_request_duration_us numbers have unreasonable values.

I am hosting a llama-3-8B model on 2 A100 GPUs with Tensor Parallelism = 2, using the vllm_backend
When I repeatedly hit the /metrics endpoint I get the following numbers.

 $ date -Is && curl http://localhost:7101/metrics 2> /dev/null | grep -P "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:40+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="hbvllm",version="1"} 3075905
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="hbvllm",version="1"} 858150828

 $ date -Is && curl http://localhost:7101/metrics 2> /dev/null | grep -P "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:41+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="hbvllm",version="1"} 3075927
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="hbvllm",version="1"} 858156398

Doing the math the average request duration per inference count comes down to an unreasonable number.

( 858156398 - 858150828 ) / (3075927 - 3075905) / 1000000 = 0.00025318181 s

This cannot be true.
With that I wanted to create this ticket to get some discussion going.

Triton Information

Using the 25.03 Triton container with the other applicable versions from the Nvidia Framework Support Matrix
For example vllm_backend (r25.03)

Are you using the Triton container or did you build it yourself?
I built it myself using the build.py from the server.

To Reproduce
Steps to reproduce the behavior.

Load a llama-3-8b model. The same weights from Huggingface. Using the base model is enough.
The model.json looks like this

 $ cat config.pbtxt

name: "hbvllm"
backend: "vllm"
instance_group [
    {
        count: 1
        kind: KIND_MODEL
    }
]

And

1 $ cat model.json
{
    "model": "<....>/hbvllm",
    "dtype": "auto",
    "tokenizer_mode": "auto",
    "load_format": "auto",
    "kv_cache_dtype": "auto",
    "disable_custom_all_reduce": false,
    "tensor_parallel_size": 2,
    "gpu_memory_utilization": 0.9,
    "enforce_eager": false
}

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

The inputs outputs are sustained 30RPS traffic with about 6K prompt tokens and about 30 generation tokens.

Expected behavior
A clear and concise description of what you expected to happen.

In order to understand the expected behavior, I have hosted the exact same model using the trtllm-backend with 0.18.0 version ( Again from the Nvidia compatibility matric)

The output from the /metrics endpoint is much more reasonable there.

 $ date -Is && curl http://localhost:8002/metrics 2> /dev/null | grep -P  "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:07+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="ensemble",version="1"} 206722
nv_inference_count{model="tensorrt_llm",version="1"} 207452
nv_inference_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_count{model="preprocessing",version="1"} 207470
nv_inference_count{model="postprocessing",version="1"} 206722
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="ensemble",version="1"} 239908674095
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 267688550623
nv_inference_request_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_duration_us{model="preprocessing",version="1"} 8688350161
nv_inference_request_duration_us{model="postprocessing",version="1"} 301745336

$ date -Is && curl http://localhost:8002/metrics 2> /dev/null | grep -P  "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:08+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="ensemble",version="1"} 206740
nv_inference_count{model="tensorrt_llm",version="1"} 207471
nv_inference_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_count{model="preprocessing",version="1"} 207491
nv_inference_count{model="postprocessing",version="1"} 206740
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="ensemble",version="1"} 239930277668
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 267710130852
nv_inference_request_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_duration_us{model="preprocessing",version="1"} 8689193487
nv_inference_request_duration_us{model="postprocessing",version="1"} 301771877

Doing the math, the average request_duration per inference request looks much more reasonable here.

(239930277668 - 239908674095) / (206740 - 206722) /1000000 = 1.2 second

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions