trtexec performance drop between bs=1 and bs=N

## Description

I tried to convert my PyTorch model to a TensorRT engine using `torch.onnx.export` and `trtexec`. However, I observed a severe performance drop between `bs=1` and `bs=2`. Specifically, inferences with `bs=2` have a QPS (queries per second) almost 50% lower than with `bs=1`.

This issue appears very similar to [#976](https://github.com/NVIDIA/TensorRT/issues/976).

I ran the following tests:

### Inference with static batch size = 1

* **Throughput**: 70.0427 QPS
* **Latency**: min = 14.6602 ms, max = 19.5797 ms, mean = 14.7998 ms
* **Enqueue Time**: min = 0.8222 ms, max = 1.1145 ms, mean = 0.9361 ms
* **H2D Latency**: min = 0.2893 ms, max = 0.5101 ms, mean = 0.2997 ms
* **GPU Compute Time**: min = 14.1738 ms, max = 18.7857 ms, mean = 14.2679 ms
* **D2H Latency**: min = 0.1887 ms, max = 0.2986 ms, mean = 0.2322 ms
* **Total Host Walltime**: 3.01245 s
* **Total GPU Compute Time**: 3.01053 s

### Inference with dynamic batch size = 1

* **Throughput**: 70.1527 QPS
* **Latency**: min = 14.6797 ms, max = 19.5344 ms, mean = 14.7761 ms
* **Enqueue Time**: min = 0.6501 ms, max = 1.4023 ms, mean = 0.9709 ms
* **H2D Latency**: min = 0.2877 ms, max = 0.4823 ms, mean = 0.2998 ms
* **GPU Compute Time**: min = 14.1780 ms, max = 18.7908 ms, mean = 14.2463 ms
* **D2H Latency**: min = 0.1880 ms, max = 0.2996 ms, mean = 0.2300 ms
* **Total Host Walltime**: 3.02198 s
* **Total GPU Compute Time**: 3.02021 s

### Inference with static batch size = 2

* **Throughput**: 36.9994 QPS
* **Latency**: min = 27.7139 ms, max = 35.2312 ms, mean = 27.9782 ms
* **Enqueue Time**: min = 0.6567 ms, max = 1.3977 ms, mean = 0.9839 ms
* **H2D Latency**: min = 0.5500 ms, max = 0.9541 ms, mean = 0.5673 ms
* **GPU Compute Time**: min = 26.7891 ms, max = 33.8774 ms, mean = 27.0105 ms
* **D2H Latency**: min = 0.3606 ms, max = 0.4104 ms, mean = 0.4004 ms
* **Total Host Walltime**: 3.0541 s
* **Total GPU Compute Time**: 3.05219 s

### Inference with dynamic batch size = 2

* **Throughput**: 36.9886 QPS
* **Latency**: min = 27.7246 ms, max = 35.6218 ms, mean = 27.9854 ms
* **Enqueue Time**: min = 0.6264 ms, max = 1.2593 ms, mean = 0.9366 ms
* **H2D Latency**: min = 0.5500 ms, max = 0.9653 ms, mean = 0.5660 ms
* **GPU Compute Time**: min = 26.7764 ms, max = 34.2528 ms, mean = 27.0181 ms
* **D2H Latency**: min = 0.3611 ms, max = 0.4097 ms, mean = 0.4013 ms
* **Total Host Walltime**: 3.0550 s
* **Total GPU Compute Time**: 3.05305 s

---

## Environment
* **TensorRT Version**: 8.6.2.3
* **Device**: NVIDIA Jetson Orin NX (16 GB RAM)
* **CUDA Version**: 12.2.140
* **CUDNN Version**: 8.9.4.25
* **Operating System**: Ubuntu 22.04 (Jammy Jellyfish)
* **ONNX**: 1.19.0

---

## Python Script to Export PyTorch -> ONNX (static batch size)

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True)
model.load_state_dict(state_dict=checkpoint["model_state_dict"])
model.eval()
dummy_input = torch.randn(size=(1, 3, 416, 608), device=device)
save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx")

torch.onnx.export(
    model=model,
    args=(dummy_input,),
    f=save_path,
    export_params=True,
    verbose=False,
    input_names=["input"],
    output_names=["output"],
    opset_version=17
)
```

## Python Script to Export PyTorch -> ONNX (dynamic batch size)

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True)
model.load_state_dict(state_dict=checkpoint["model_state_dict"])
model.eval()
dummy_input = torch.randn(size=(2, 3, 416, 608), device=device)
save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx")

torch.onnx.export(
    model=model,
    args=(dummy_input,),
    f=save_path,
    export_params=True,
    verbose=False,
    input_names=["input"],
    output_names=["output"],
    opset_version=17,
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
)
```
---

## TRT Commands

**Static batch size**

```bash
trtexec --onnx=test_static_bs_1.onnx --saveEngine=test.engine --fp16
trtexec --onnx=test_static_bs_2.onnx --saveEngine=test.engine --fp16
```

**Dynamic batch size**

```bash
trtexec --onnx=model.onnx \
        --saveEngine=model.engine \
        --minShapes=input:1x3x416x608 \
        --optShapes=input:2x3x416x608 \
        --maxShapes=input:16x3x416x608 \
        --fp16
```

**Inference**

```bash
trtexec --loadEngine=file.engine --shapes=input:Nx3x416x608 --fp16
```

## Observations

* Throughput drops almost 2× when moving from batch 1 (~70 QPS) to batch 2 (~37 QPS).
* Latency roughly doubles with batch 2 (~28 ms vs ~14.8 ms).
* GPU compute time scales linearly with batch size (~14 ms → ~27 ms), indicating poor batch efficiency.
* H2D and D2H transfer times slightly increase with batch but are not the main bottleneck.
* Enqueue time is similar across batch sizes.
* Static vs dynamic batching shows almost identical performance, so it’s not the source of the issue.

## Questions

* Is it normal to observe such a severe performance drop between `bs=1` and `bs=2` when using batch sizes > 1?
* Could this be caused by an issue in the ONNX export or TensorRT engine conversion?
* Are there any recommended steps or best practices to improve performance for dynamic batch sizes in TensorRT?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

trtexec performance drop between bs=1 and bs=N #4563

Description

Inference with static batch size = 1

Inference with dynamic batch size = 1

Inference with static batch size = 2

Inference with dynamic batch size = 2

Environment

Python Script to Export PyTorch -> ONNX (static batch size)

Python Script to Export PyTorch -> ONNX (dynamic batch size)

TRT Commands

Observations

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

trtexec performance drop between bs=1 and bs=N #4563

Description

Description

Inference with static batch size = 1

Inference with dynamic batch size = 1

Inference with static batch size = 2

Inference with dynamic batch size = 2

Environment

Python Script to Export PyTorch -> ONNX (static batch size)

Python Script to Export PyTorch -> ONNX (dynamic batch size)

TRT Commands

Observations

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions