-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Module:PerformanceGeneral performance issuesGeneral performance issues
Description
Description
I tried to convert my PyTorch model to a TensorRT engine using torch.onnx.export and trtexec. However, I observed a severe performance drop between bs=1 and bs=2. Specifically, inferences with bs=2 have a QPS (queries per second) almost 50% lower than with bs=1.
This issue appears very similar to #976.
I ran the following tests:
Inference with static batch size = 1
- Throughput: 70.0427 QPS
- Latency: min = 14.6602 ms, max = 19.5797 ms, mean = 14.7998 ms
- Enqueue Time: min = 0.8222 ms, max = 1.1145 ms, mean = 0.9361 ms
- H2D Latency: min = 0.2893 ms, max = 0.5101 ms, mean = 0.2997 ms
- GPU Compute Time: min = 14.1738 ms, max = 18.7857 ms, mean = 14.2679 ms
- D2H Latency: min = 0.1887 ms, max = 0.2986 ms, mean = 0.2322 ms
- Total Host Walltime: 3.01245 s
- Total GPU Compute Time: 3.01053 s
Inference with dynamic batch size = 1
- Throughput: 70.1527 QPS
- Latency: min = 14.6797 ms, max = 19.5344 ms, mean = 14.7761 ms
- Enqueue Time: min = 0.6501 ms, max = 1.4023 ms, mean = 0.9709 ms
- H2D Latency: min = 0.2877 ms, max = 0.4823 ms, mean = 0.2998 ms
- GPU Compute Time: min = 14.1780 ms, max = 18.7908 ms, mean = 14.2463 ms
- D2H Latency: min = 0.1880 ms, max = 0.2996 ms, mean = 0.2300 ms
- Total Host Walltime: 3.02198 s
- Total GPU Compute Time: 3.02021 s
Inference with static batch size = 2
- Throughput: 36.9994 QPS
- Latency: min = 27.7139 ms, max = 35.2312 ms, mean = 27.9782 ms
- Enqueue Time: min = 0.6567 ms, max = 1.3977 ms, mean = 0.9839 ms
- H2D Latency: min = 0.5500 ms, max = 0.9541 ms, mean = 0.5673 ms
- GPU Compute Time: min = 26.7891 ms, max = 33.8774 ms, mean = 27.0105 ms
- D2H Latency: min = 0.3606 ms, max = 0.4104 ms, mean = 0.4004 ms
- Total Host Walltime: 3.0541 s
- Total GPU Compute Time: 3.05219 s
Inference with dynamic batch size = 2
- Throughput: 36.9886 QPS
- Latency: min = 27.7246 ms, max = 35.6218 ms, mean = 27.9854 ms
- Enqueue Time: min = 0.6264 ms, max = 1.2593 ms, mean = 0.9366 ms
- H2D Latency: min = 0.5500 ms, max = 0.9653 ms, mean = 0.5660 ms
- GPU Compute Time: min = 26.7764 ms, max = 34.2528 ms, mean = 27.0181 ms
- D2H Latency: min = 0.3611 ms, max = 0.4097 ms, mean = 0.4013 ms
- Total Host Walltime: 3.0550 s
- Total GPU Compute Time: 3.05305 s
Environment
- TensorRT Version: 8.6.2.3
- Device: NVIDIA Jetson Orin NX (16 GB RAM)
- CUDA Version: 12.2.140
- CUDNN Version: 8.9.4.25
- Operating System: Ubuntu 22.04 (Jammy Jellyfish)
- ONNX: 1.19.0
Python Script to Export PyTorch -> ONNX (static batch size)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True)
model.load_state_dict(state_dict=checkpoint["model_state_dict"])
model.eval()
dummy_input = torch.randn(size=(1, 3, 416, 608), device=device)
save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx")
torch.onnx.export(
model=model,
args=(dummy_input,),
f=save_path,
export_params=True,
verbose=False,
input_names=["input"],
output_names=["output"],
opset_version=17
)Python Script to Export PyTorch -> ONNX (dynamic batch size)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(f=weights_pth, map_location="cpu", weights_only=True)
model.load_state_dict(state_dict=checkpoint["model_state_dict"])
model.eval()
dummy_input = torch.randn(size=(2, 3, 416, 608), device=device)
save_path = weights_pth.parents[0] / (weights_pth.stem + ".onnx")
torch.onnx.export(
model=model,
args=(dummy_input,),
f=save_path,
export_params=True,
verbose=False,
input_names=["input"],
output_names=["output"],
opset_version=17,
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"}
},
)TRT Commands
Static batch size
trtexec --onnx=test_static_bs_1.onnx --saveEngine=test.engine --fp16
trtexec --onnx=test_static_bs_2.onnx --saveEngine=test.engine --fp16Dynamic batch size
trtexec --onnx=model.onnx \
--saveEngine=model.engine \
--minShapes=input:1x3x416x608 \
--optShapes=input:2x3x416x608 \
--maxShapes=input:16x3x416x608 \
--fp16Inference
trtexec --loadEngine=file.engine --shapes=input:Nx3x416x608 --fp16Observations
- Throughput drops almost 2× when moving from batch 1 (~70 QPS) to batch 2 (~37 QPS).
- Latency roughly doubles with batch 2 (~28 ms vs ~14.8 ms).
- GPU compute time scales linearly with batch size (~14 ms → ~27 ms), indicating poor batch efficiency.
- H2D and D2H transfer times slightly increase with batch but are not the main bottleneck.
- Enqueue time is similar across batch sizes.
- Static vs dynamic batching shows almost identical performance, so it’s not the source of the issue.
Questions
- Is it normal to observe such a severe performance drop between
bs=1andbs=2when using batch sizes > 1? - Could this be caused by an issue in the ONNX export or TensorRT engine conversion?
- Are there any recommended steps or best practices to improve performance for dynamic batch sizes in TensorRT?
wsj20010128
Metadata
Metadata
Assignees
Labels
Module:PerformanceGeneral performance issuesGeneral performance issues