-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I am analyzing the performance of the Qwen2-VL model on NVIDIA Thor using the TREx (TensorRT Engine Explorer) tool.
According to the README, when using trtexec to time individual layers, the sum of per-layer average latencies is expected to be higher than the end-to-end engine latency, due to measurement overhead.
This matches what I observe on ViT and LLM Prefill workloads.
However, when analyzing the LLM Generation phase, I observe the opposite behavior:
for FP8 and INT4 quantized engines, the sum of layer latencies reported by TREx is consistently lower than the end-to-end latency.
I manually re-computed latency statistics from the JSON file generated by trtexec and confirmed that TREx is accurately reflecting the JSON contents.
Therefore, I would like to confirm whether this behavior is expected or if there may be an issue with how I invoked trtexec.
Environment
TensorRT Version: 10.13.1
NVIDIA GPU: nvidia thor
NVIDIA Driver Version:
CUDA Version: 12.8
CUDNN Version:
Operating System:
Python Version (if applicable): 3.12.3
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link:
Steps To Reproduce
Convert the fine-tuned Qwen2-VL model into a TensorRT engine
I used the tensorrt-llm workflow to build the engine from a fine-tuned Qwen2-VL checkpoint.
During engine building, I enabled detailed profiling with:
config->setProfilingVerbosity(nvinfer1::ProfilingVerbosity::kDETAILED);
Generate profiling JSON outputs using trtexec
After obtaining the engine, I executed trtexec with the following command (Python-style argument construction shown here):
trtexec_path,
"--verbose",
"--useCudaGraph",
"--separateProfileRun",
"--useSpinWait",
f"--useProfile={profile}",
f"--loadEngine={engine_path}",
f"--exportTimes={timing_json}",
f"--exportProfile={profiling_json}",
f"--exportLayerInfo={graph_json}",
f"--timingCacheFile={timing_cache}",
"--profilingVerbosity=detailed"
Using --noDataTransfers results in:
sampleInference.cpp:1017: an illegal memory access was encountered
so this flag was removed.
Compare TREx results with raw JSON output
For the INT4 quantation, the value of "mean" under "GPU Compute Time" in profile.metadata.json is:
7.42306 ms
After summing all "averageMs" values in profile.json across layers, the result is:
7.09662919 ms
which is lower than the end-to-end "GPU Compute Time" value.
TREx reports the same cumulative layer-time result as the JSON, confirming that its statistics match the raw trtexec output.
Here is profile.json and profile.metadata.json
llm_int4_noneagle.engine.1.profile.json
llm_int4_noneagle.engine.1.profile.metadata.json