nv_inference_request_duration_us metrics have seemingly wrong values if vllm_backend is used

**Description**
Using the 25.03 Triton container with the other applicable versions from the [Nvidia Framework Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) 
For example vllm_backend (r25.03) 
 
When I get the response from the /metrics endpoint from triton, the nv_inference_request_duration_us numbers have unreasonable values. 

I am hosting a llama-3-8B model on 2 A100 GPUs with Tensor Parallelism = 2,  using the vllm_backend
When I repeatedly hit the /metrics endpoint I get the following numbers. 

```
 $ date -Is && curl http://localhost:7101/metrics 2> /dev/null | grep -P "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:40+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="hbvllm",version="1"} 3075905
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="hbvllm",version="1"} 858150828

 $ date -Is && curl http://localhost:7101/metrics 2> /dev/null | grep -P "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:41+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="hbvllm",version="1"} 3075927
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="hbvllm",version="1"} 858156398
```

Doing the math the average request duration per inference count comes down to an unreasonable number. 

(  858156398 - 858150828 ) / (3075927 - 3075905) / 1000000 = 0.00025318181 s 

This cannot be true. 
With that I wanted to create this ticket to get some discussion going. 

**Triton Information**

Using the 25.03 Triton container with the other applicable versions from the [Nvidia Framework Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) 
For example vllm_backend (r25.03) 


Are you using the Triton container or did you build it yourself?
I built it myself using the build.py from the server. 


**To Reproduce**
Steps to reproduce the behavior.

Load a llama-3-8b model. The same weights from Huggingface. Using the base model is enough. 
The model.json looks like this 

```
 $ cat config.pbtxt

name: "hbvllm"
backend: "vllm"
instance_group [
    {
        count: 1
        kind: KIND_MODEL
    }
]
```

And 

```
1 $ cat model.json
{
    "model": "<....>/hbvllm",
    "dtype": "auto",
    "tokenizer_mode": "auto",
    "load_format": "auto",
    "kv_cache_dtype": "auto",
    "disable_custom_all_reduce": false,
    "tensor_parallel_size": 2,
    "gpu_memory_utilization": 0.9,
    "enforce_eager": false
}
```


Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

The inputs outputs are sustained 30RPS traffic with about 6K prompt tokens and about 30 generation tokens. 



**Expected behavior**
A clear and concise description of what you expected to happen.


In order to understand the expected behavior, I have hosted the exact same model using the trtllm-backend with 0.18.0 version ( Again from the Nvidia compatibility matric) 

The output from the /metrics endpoint is much more reasonable there. 

```
 $ date -Is && curl http://localhost:8002/metrics 2> /dev/null | grep -P  "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:07+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="ensemble",version="1"} 206722
nv_inference_count{model="tensorrt_llm",version="1"} 207452
nv_inference_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_count{model="preprocessing",version="1"} 207470
nv_inference_count{model="postprocessing",version="1"} 206722
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="ensemble",version="1"} 239908674095
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 267688550623
nv_inference_request_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_duration_us{model="preprocessing",version="1"} 8688350161
nv_inference_request_duration_us{model="postprocessing",version="1"} 301745336

$ date -Is && curl http://localhost:8002/metrics 2> /dev/null | grep -P  "nv_inference_request_duration_us|nv_inference_count"
2025-10-18T04:12:08+00:00
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="ensemble",version="1"} 206740
nv_inference_count{model="tensorrt_llm",version="1"} 207471
nv_inference_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_count{model="preprocessing",version="1"} 207491
nv_inference_count{model="postprocessing",version="1"} 206740
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="ensemble",version="1"} 239930277668
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 267710130852
nv_inference_request_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_duration_us{model="preprocessing",version="1"} 8689193487
nv_inference_request_duration_us{model="postprocessing",version="1"} 301771877
```

Doing the math, the average request_duration per inference request looks much more reasonable here. 

(239930277668 - 239908674095) / (206740 - 206722) /1000000 = 1.2 second


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nv_inference_request_duration_us metrics have seemingly wrong values if vllm_backend is used #8463

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nv_inference_request_duration_us metrics have seemingly wrong values if vllm_backend is used #8463

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions