-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Hi NVIDIA team,
I encountered a severe performance issue when converting a GPT-like model (with attention_mask input) from ONNX to TensorRT.
After successful conversion and deployment on DriveOS (AGX Orin, TensorRT 8.6.15), I used Nsight Systems (nsys) to analyze runtime performance. Surprisingly, more than 90% of total inference time is spent on attention_mask related operations, while actual matrix multiplications and other Transformer layers take very little time.
This behavior seems abnormal — I suspect the mask computation is not being efficiently fused or optimized in TensorRT.
Environment
- Platform: NVIDIA Drive AGX Orin
- DriveOS Version: (6.12.1)
- TensorRT Version: 8.6.15
- CUDA Version: (11.4)
- Model type: GPT-like transformer (causal decoder)
- Input: token embeddings + attention_mask + pask_kv_cache
- Precision: FP16
- Batch size: (1)
- ONNX opset: 16
Steps to Reproduce
- Export model from PyTorch → ONNX (with attention_mask input).
- Convert ONNX to TensorRT engine:
/usr/src/tensorrt/bin/trtexec --onnx=test.onnx --saveEngine=test.engine --exportLayerInfo=test_layerinfo.log --profilingVerbosity=detailed --exportProfile=test_profile.log --separateProfileRun --duration=5 --streams=1 --useCudaGraph --fp16 --verbose- Run inference and profile using Nsight Systems:
nsys profile --force-overwrite true -o test --trace=cuda,nvtx,cublas,cudnn --stats=true \
/usr/src/tensorrt/bin/trtexec \
--loadEngine=test.engine \
--iterations=10 --idleTime=500 --duration=0 --useSpinWait- Observe runtime results —
attention_maskrelated kernels dominate total time (>90%).
Expected Behavior
attention_mask operations should be lightweight (simple broadcast or add ops) and not dominate runtime.
Due to the attention_mask generating many intermediate (glue) operators, it may be one of the main bottlenecks during TensorRT inference.
Attachments
I’ve attached:
test.onnx- TensorRT conversion log (
test.txt) - Nsight Systems trace (
test.nsys-rep)
→ Upload as: attention_mask_perf_issue.zip
Questions
- Is this a known issue or regression in TRT 8.6.15 on DriveOS?
- Are there recommended practices to optimize or fuse
attention_maskhandling (e.g. using plugin or model rewrite)?
Additional Notes
Additionally, the attached test.onnx is part of the full model. I also attempted removing attention_mask while setting past_key_values_input as a dynamic input. However, after conversion, because of the dynamic input, the latency was still not ideal.
Would really appreciate any feedback or workaround — this issue currently blocks our deployment. Thanks a lot for your time and support!