Excessive inference time spent on attention_mask operations after ONNX→TensorRT conversion (DriveOS AGX Orin + TRT 8.6.15)

---
# 



Hi NVIDIA team,

I encountered a severe performance issue when converting a GPT-like model (with `attention_mask` input) from ONNX to TensorRT.

After successful conversion and deployment on **DriveOS (AGX Orin, TensorRT 8.6.15)**, I used **Nsight Systems (nsys)** to analyze runtime performance. Surprisingly, more than **90% of total inference time** is spent on `attention_mask` related operations, while actual matrix multiplications and other Transformer layers take very little time.

This behavior seems abnormal — I suspect the mask computation is not being efficiently fused or optimized in TensorRT.

---

## Environment
- **Platform:** NVIDIA Drive AGX Orin  
- **DriveOS Version:** (6.12.1)  
- **TensorRT Version:** 8.6.15  
- **CUDA Version:** (11.4)  
- **Model type:** GPT-like transformer (causal decoder)  
- **Input:** token embeddings + attention_mask  + pask_kv_cache
- **Precision:** FP16  
- **Batch size:** (1)  
- **ONNX opset:** 16

---

## Steps to Reproduce
1. Export model from PyTorch → ONNX (with attention_mask input).
2. Convert ONNX to TensorRT engine:

```bash
/usr/src/tensorrt/bin/trtexec --onnx=test.onnx --saveEngine=test.engine --exportLayerInfo=test_layerinfo.log --profilingVerbosity=detailed --exportProfile=test_profile.log --separateProfileRun --duration=5 --streams=1 --useCudaGraph --fp16 --verbose
```

3. Run inference and profile using Nsight Systems:

```bash
nsys profile --force-overwrite true -o test --trace=cuda,nvtx,cublas,cudnn  --stats=true \
        /usr/src/tensorrt/bin/trtexec  \
        --loadEngine=test.engine  \
       --iterations=10 --idleTime=500 --duration=0 --useSpinWait
```

4. Observe runtime results — `attention_mask` related kernels dominate total time (>90%).

---

## Expected Behavior
`attention_mask` operations should be lightweight (simple broadcast or add ops) and not dominate runtime. 

Due to the `attention_mask` generating many intermediate (glue) operators, it may be one of the main bottlenecks during TensorRT inference.

---

## Attachments
I’ve attached:
- `test.onnx`  
- TensorRT conversion log (`test.txt`)  
- Nsight Systems trace (`test.nsys-rep`)

**→ Upload as: `attention_mask_perf_issue.zip`**

---

## Questions
1. Is this a known issue or regression in TRT 8.6.15 on DriveOS?  
2. Are there recommended practices to optimize or fuse `attention_mask` handling (e.g. using plugin or model rewrite)?  

---

## Additional Notes
Additionally, the attached test.onnx is part of the full model. I also attempted removing attention_mask while setting past_key_values_input as a dynamic input. However, after conversion, because of the dynamic input, the latency was still not ideal.

Would really appreciate any feedback or workaround — this issue currently blocks our deployment. Thanks a lot for your time and support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive inference time spent on attention_mask operations after ONNX→TensorRT conversion (DriveOS AGX Orin + TRT 8.6.15) #4610

Environment

Steps to Reproduce

Expected Behavior

Attachments

Questions

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Excessive inference time spent on attention_mask operations after ONNX→TensorRT conversion (DriveOS AGX Orin + TRT 8.6.15) #4610

Description

Environment

Steps to Reproduce

Expected Behavior

Attachments

Questions

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions