-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Module:PerformanceGeneral performance issuesGeneral performance issuesinternal-bug-trackedTracked internally, will be fixed in a future release.Tracked internally, will be fixed in a future release.triagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
System Information
- OS: Ubuntu 22.04
- GPU: NVIDIA RTX 4090
- TensorRT Version: 10.11.0.33
- PyTorch Version: 2.7.0
- ONNX Opset: 14
🧠 Problem Summary
I converted a very basic bidirectional LSTM model from PyTorch to ONNX, and then to TensorRT using trtexec. However, inference with the TensorRT engine is slower than PyTorch, which is unexpected.
- PyTorch: ~0.5ms per forward pass
- TensorRT: ~1ms per forward pass
📦 Model Description
# Model config
INPUT_SIZE = 293
HIDDEN_SIZE = 128
NUM_LAYERS = 2
BIDIRECTION = True
BATCH_FIRST = True
DROPOUT = 0.0
# Model instantiation
lstm = nn.LSTM(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS,
bidirectional=BIDIRECTION,
batch_first=BATCH_FIRST,
dropout=DROPOUT)🔄 Conversion Steps
- Export to ONNX:
dummy = torch.randn(32, 60, INPUT_SIZE)
torch.onnx.export(
lstm, dummy, "pyannet_lstm.onnx",
opset_version=14,
input_names=["input"],
output_names=["output", "h_out", "c_out"],
dynamic_axes={
"input": {0: "batch", 1: "time"},
"output": {0: "batch", 1: "time"},
"h_out": {1: "batch"},
"c_out": {1: "batch"}
}
)- Build TensorRT engine:
trtexec \
--onnx=pyannet_lstm.onnx \
--minShapes=input:1x60x293 \
--optShapes=input:32x60x293 \
--maxShapes=input:32x60x293 \
--saveEngine=lstm.engine🧪 Performance Benchmark
PyTorch benchmark code:
x = torch.randn(32, 60, 293).cuda()
lstm.to("cuda").eval()
with torch.no_grad():
for _ in range(1000):
output, (h, c) = lstm(x)- Average PyTorch time per batch: ~0.5ms
- Average TensorRT time per batch: ~1.0ms
🔍 Profiling Observation
Using NVIDIA Nsight Systems, I observed:
- PyTorch uses a fused kernel:
RNN_blockPersist_fp_LSTM - TensorRT seems to decompose the model into many small ops instead of using a fused LSTM kernel
❓Questions
- Is it expected that TensorRT does not fuse the LSTM into a single kernel like
RNN_blockPersist_fp_LSTM? - Are there flags or version requirements to enable such fusion?
- Is this a known limitation with ONNX -> TensorRT conversion for LSTM?
Metadata
Metadata
Assignees
Labels
Module:PerformanceGeneral performance issuesGeneral performance issuesinternal-bug-trackedTracked internally, will be fixed in a future release.Tracked internally, will be fixed in a future release.triagedIssue has been triaged by maintainersIssue has been triaged by maintainers