Skip to content

Excessive inference time spent on attention_mask operations after ONNX→TensorRT conversion (DriveOS AGX Orin + TRT 8.6.15) #4610

@JieShare

Description

@JieShare

Hi NVIDIA team,

I encountered a severe performance issue when converting a GPT-like model (with attention_mask input) from ONNX to TensorRT.

After successful conversion and deployment on DriveOS (AGX Orin, TensorRT 8.6.15), I used Nsight Systems (nsys) to analyze runtime performance. Surprisingly, more than 90% of total inference time is spent on attention_mask related operations, while actual matrix multiplications and other Transformer layers take very little time.

This behavior seems abnormal — I suspect the mask computation is not being efficiently fused or optimized in TensorRT.


Environment

  • Platform: NVIDIA Drive AGX Orin
  • DriveOS Version: (6.12.1)
  • TensorRT Version: 8.6.15
  • CUDA Version: (11.4)
  • Model type: GPT-like transformer (causal decoder)
  • Input: token embeddings + attention_mask + pask_kv_cache
  • Precision: FP16
  • Batch size: (1)
  • ONNX opset: 16

Steps to Reproduce

  1. Export model from PyTorch → ONNX (with attention_mask input).
  2. Convert ONNX to TensorRT engine:
/usr/src/tensorrt/bin/trtexec --onnx=test.onnx --saveEngine=test.engine --exportLayerInfo=test_layerinfo.log --profilingVerbosity=detailed --exportProfile=test_profile.log --separateProfileRun --duration=5 --streams=1 --useCudaGraph --fp16 --verbose
  1. Run inference and profile using Nsight Systems:
nsys profile --force-overwrite true -o test --trace=cuda,nvtx,cublas,cudnn  --stats=true \
        /usr/src/tensorrt/bin/trtexec  \
        --loadEngine=test.engine  \
       --iterations=10 --idleTime=500 --duration=0 --useSpinWait
  1. Observe runtime results — attention_mask related kernels dominate total time (>90%).

Expected Behavior

attention_mask operations should be lightweight (simple broadcast or add ops) and not dominate runtime.

Due to the attention_mask generating many intermediate (glue) operators, it may be one of the main bottlenecks during TensorRT inference.


Attachments

I’ve attached:

  • test.onnx
  • TensorRT conversion log (test.txt)
  • Nsight Systems trace (test.nsys-rep)

→ Upload as: attention_mask_perf_issue.zip


Questions

  1. Is this a known issue or regression in TRT 8.6.15 on DriveOS?
  2. Are there recommended practices to optimize or fuse attention_mask handling (e.g. using plugin or model rewrite)?

Additional Notes

Additionally, the attached test.onnx is part of the full model. I also attempted removing attention_mask while setting past_key_values_input as a dynamic input. However, after conversion, because of the dynamic input, the latency was still not ideal.

Would really appreciate any feedback or workaround — this issue currently blocks our deployment. Thanks a lot for your time and support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Module:Embeddedissues when using TensorRT on embedded platformsModule:ONNXIssues relating to ONNX usage and importModule:PerformanceGeneral performance issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions