Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6

## Description

NMS Layers are much slower on TensorRT than on PyTorch (44% of the performance) and I'm looking for any possible workaround. This seems to be acknowledged as a known issue in the TensorRT release notes [here](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html#rel-10-6-0):
> A performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations

Is there any possible workaround or a fix planned in a specific future version? I am specifically using these layers inside a `FasterRCNN` network (as implemented in `torchvision` [here](https://pytorch.org/vision/main/models/faster_rcnn.html)). I observe this network to be much slower when running either with a single image or 4 images:
- Single image inference latency: 7.8ms on PyTorch, 13.3ms on TensorRT
- 4 image inference latency: 22.8ms on PyTorch, 53.5ms on TensorRT

When I run this network with per-layer profiling, I see that the `NonMaxSuppression` layers account for 75%+ of the overall inference time. I have verified this on TensorRT 10.0 and 10.6. I have tested using ONNX opset 11 and opset 17.

## Environment



**TensorRT Version**: 10.0, 10.6

**NVIDIA GPU**: GeForce RTX 4090

**NVIDIA Driver Version**: 550.54.15

**CUDA Version**: 12.4

**CUDNN Version**: unsure


Operating System:

Python Version (if applicable): 3.9

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.2

Baremetal or Container (if so, version):


## Relevant Files



**Model link**: https://pytorch.org/vision/main/models/faster_rcnn.html


## Steps To Reproduce



1. Export FasterRCNN to ONNX
2. Pass ONNX into `trtexec`
4. Compare `trtexec` output to PyTorch equivalent benchmark

**Commands or scripts**:

**Have you tried [the latest release](https://developer.nvidia.com/tensorrt)?**: Yes I have tried TensorRT 10.6 and 10.0

**Can this model run on other frameworks?** For example run ONNX model with ONNXRuntime (`polygraphy run <model.onnx> --onnxrt`): Yes it runs on onnxruntime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

Description

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions