Skip to content

trtexec performance drop between bs=1 and bs=N #4563

@mw-vince

Description

@mw-vince

Description

I tried to convert my PyTorch model to a TensorRT engine using torch.onnx.export and trtexec. However, I observed a severe performance drop between bs=1 and bs=2. Specifically, inferences with bs=2 have a QPS (queries per second) almost 50% lower than with bs=1.

This issue appears very similar to #976.

I ran the following tests:

Inference with static batch size = 1

  • Throughput: 70.0427 QPS
  • Latency: min = 14.6602 ms, max = 19.5797 ms, mean = 14.7998 ms
  • Enqueue Time: min = 0.8222 ms, max = 1.1145 ms, mean = 0.9361 ms
  • H2D Latency: min = 0.2893 ms, max = 0.5101 ms, mean = 0.2997 ms
  • GPU Compute Time: min = 14.1738 ms, max = 18.7857 ms, mean = 14.2679 ms
  • D2H Latency: min = 0.1887 ms, max = 0.2986 ms, mean = 0.2322 ms
  • Total Host Walltime: 3.01245 s
  • Total GPU Compute Time: 3.01053 s

Inference with dynamic batch size = 1

  • Throughput: 70.1527 QPS
  • Latency: min = 14.6797 ms, max = 19.5344 ms, mean = 14.7761 ms
  • Enqueue Time: min = 0.6501 ms, max = 1.4023 ms, mean = 0.9709 ms
  • H2D Latency: min = 0.2877 ms, max = 0.4823 ms, mean = 0.2998 ms
  • GPU Compute Time: min = 14.1780 ms, max = 18.7908 ms, mean = 14.2463 ms
  • D2H Latency: min = 0.1880 ms, max = 0.2996 ms, mean = 0.2300 ms
  • Total Host Walltime: 3.02198 s
  • Total GPU Compute Time: 3.02021 s

Inference with static batch size = 2

  • Throughput: 36.9994 QPS
  • Latency: min = 27.7139 ms, max = 35.2312 ms, mean = 27.9782 ms
  • Enqueue Time: min = 0.6567 ms, max = 1.3977 ms, mean = 0.9839 ms
  • H2D Latency: min = 0.5500 ms, max = 0.9541 ms, mean = 0.5673 ms
  • GPU Compute Time: min = 26.7891 ms, max = 33.8774 ms, mean = 27.0105 ms
  • D2H Latency: min = 0.3606 ms, max = 0.4104 ms, mean = 0.4004 ms
  • Total Host Walltime: 3.0541 s
  • Total GPU Compute Time: 3.05219 s

Inference with dynamic batch size = 2

  • Throughput: 36.9886 QPS
  • Latency: min = 27.7246 ms, max = 35.6218 ms, mean = 27.9854 ms
  • Enqueue Time: min = 0.6264 ms, max = 1.2593 ms, mean = 0.9366 ms
  • H2D Latency: min = 0.5500 ms, max = 0.9653 ms, mean = 0.5660 ms
  • GPU Compute Time: min = 26.7764 ms, max = 34.2528 ms, mean = 27.0181 ms
  • D2H Latency: min = 0.3611 ms, max = 0.4097 ms, mean = 0.4013 ms
  • Total Host Walltime: 3.0550 s
  • Total GPU Compute Time: 3.05305 s

Environment

  • TensorRT Version: 8.6.2.3
  • Device: NVIDIA Jetson Orin NX (16 GB RAM)
  • CUDA Version: 12.2.140
  • CUDNN Version: 8.9.4.25
  • Operating System: Ubuntu 22.04 (Jammy Jellyfish)
  • ONNX: 1.19.0

Python Script to Export PyTorch -> ONNX (static batch size)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True)
model.load_state_dict(state_dict=checkpoint["model_state_dict"])
model.eval()
dummy_input = torch.randn(size=(1, 3, 416, 608), device=device)
save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx")

torch.onnx.export(
    model=model,
    args=(dummy_input,),
    f=save_path,
    export_params=True,
    verbose=False,
    input_names=["input"],
    output_names=["output"],
    opset_version=17
)

Python Script to Export PyTorch -> ONNX (dynamic batch size)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True)
model.load_state_dict(state_dict=checkpoint["model_state_dict"])
model.eval()
dummy_input = torch.randn(size=(2, 3, 416, 608), device=device)
save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx")

torch.onnx.export(
    model=model,
    args=(dummy_input,),
    f=save_path,
    export_params=True,
    verbose=False,
    input_names=["input"],
    output_names=["output"],
    opset_version=17,
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
)

TRT Commands

Static batch size

trtexec --onnx=test_static_bs_1.onnx --saveEngine=test.engine --fp16
trtexec --onnx=test_static_bs_2.onnx --saveEngine=test.engine --fp16

Dynamic batch size

trtexec --onnx=model.onnx \
        --saveEngine=model.engine \
        --minShapes=input:1x3x416x608 \
        --optShapes=input:2x3x416x608 \
        --maxShapes=input:16x3x416x608 \
        --fp16

Inference

trtexec --loadEngine=file.engine --shapes=input:Nx3x416x608 --fp16

Observations

  • Throughput drops almost 2× when moving from batch 1 (~70 QPS) to batch 2 (~37 QPS).
  • Latency roughly doubles with batch 2 (~28 ms vs ~14.8 ms).
  • GPU compute time scales linearly with batch size (~14 ms → ~27 ms), indicating poor batch efficiency.
  • H2D and D2H transfer times slightly increase with batch but are not the main bottleneck.
  • Enqueue time is similar across batch sizes.
  • Static vs dynamic batching shows almost identical performance, so it’s not the source of the issue.

Questions

  • Is it normal to observe such a severe performance drop between bs=1 and bs=2 when using batch sizes > 1?
  • Could this be caused by an issue in the ONNX export or TensorRT engine conversion?
  • Are there any recommended steps or best practices to improve performance for dynamic batch sizes in TensorRT?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions