[Question] [TREx] Unexpected TREx Layer-Sum vs End-to-End Latency Behavior in LLM Generation Phase (Qwen2-VL on NVIDIA Thor)

## Description

I am analyzing the performance of the Qwen2-VL model on NVIDIA Thor using the TREx (TensorRT Engine Explorer) tool.

According to the README, when using trtexec to time individual layers, the sum of per-layer average latencies is expected to be higher than the end-to-end engine latency, due to measurement overhead.
This matches what I observe on ViT and LLM Prefill workloads.

However, when analyzing the LLM Generation phase, I observe the opposite behavior:
for FP8 and INT4 quantized engines, the sum of layer latencies reported by TREx is consistently lower than the end-to-end latency.

I manually re-computed latency statistics from the JSON file generated by trtexec and confirmed that TREx is accurately reflecting the JSON contents.
Therefore, I would like to confirm whether this behavior is expected or if there may be an issue with how I invoked trtexec.

## Environment



**TensorRT Version**: 10.13.1

**NVIDIA GPU**: nvidia thor

**NVIDIA Driver Version**: 

**CUDA Version**: 12.8

**CUDNN Version**: 


Operating System:

Python Version (if applicable): 3.12.3

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):


## Relevant Files


**Model link**:


## Steps To Reproduce

Convert the fine-tuned Qwen2-VL model into a TensorRT engine
I used the tensorrt-llm workflow to build the engine from a fine-tuned Qwen2-VL checkpoint.
During engine building, I enabled detailed profiling with:

`config->setProfilingVerbosity(nvinfer1::ProfilingVerbosity::kDETAILED);`


Generate profiling JSON outputs using trtexec
After obtaining the engine, I executed trtexec with the following command (Python-style argument construction shown here):

```
trtexec_path,
"--verbose",
"--useCudaGraph",
"--separateProfileRun",
"--useSpinWait",
f"--useProfile={profile}",
f"--loadEngine={engine_path}",
f"--exportTimes={timing_json}",
f"--exportProfile={profiling_json}",
f"--exportLayerInfo={graph_json}",
f"--timingCacheFile={timing_cache}",
"--profilingVerbosity=detailed"
```


Using --noDataTransfers results in:

> sampleInference.cpp:1017: an illegal memory access was encountered


so this flag was removed.

Compare TREx results with raw JSON output
For the INT4 quantation, the value of "mean" under "GPU Compute Time" in profile.metadata.json is:

> 7.42306 ms


After summing all "averageMs" values in profile.json across layers, the result is:

> 7.09662919 ms


which is lower than the end-to-end "GPU Compute Time" value.

TREx reports the same cumulative layer-time result as the JSON, confirming that its statistics match the raw trtexec output.


Here is profile.json and profile.metadata.json

[llm_int4_noneagle.engine.1.profile.json](https://github.com/user-attachments/files/23924679/llm_int4_noneagle.engine.1.profile.json)
[llm_int4_noneagle.engine.1.profile.metadata.json](https://github.com/user-attachments/files/23924680/llm_int4_noneagle.engine.1.profile.metadata.json)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] [TREx] Unexpected TREx Layer-Sum vs End-to-End Latency Behavior in LLM Generation Phase (Qwen2-VL on NVIDIA Thor) #4657

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] [TREx] Unexpected TREx Layer-Sum vs End-to-End Latency Behavior in LLM Generation Phase (Qwen2-VL on NVIDIA Thor) #4657

Description

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions