LSTM model converted to TensorRT is slower than PyTorch on RTX 4090

**System Information**

* **OS**: Ubuntu 22.04
* **GPU**: NVIDIA RTX 4090
* **TensorRT Version**: 10.11.0.33
* **PyTorch Version**: 2.7.0
* **ONNX Opset**: 14

---

### 🧠 Problem Summary

I converted a very basic bidirectional LSTM model from PyTorch to ONNX, and then to TensorRT using `trtexec`. However, inference with the TensorRT engine is **slower** than PyTorch, which is unexpected.

* **PyTorch**: \~0.5ms per forward pass
* **TensorRT**: \~1ms per forward pass

---

### 📦 Model Description

```python
# Model config
INPUT_SIZE   = 293
HIDDEN_SIZE  = 128
NUM_LAYERS   = 2
BIDIRECTION  = True
BATCH_FIRST  = True
DROPOUT      = 0.0

# Model instantiation
lstm = nn.LSTM(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS,
               bidirectional=BIDIRECTION,
               batch_first=BATCH_FIRST,
               dropout=DROPOUT)
```

---

### 🔄 Conversion Steps

1. **Export to ONNX**:

```python
dummy = torch.randn(32, 60, INPUT_SIZE)

torch.onnx.export(
    lstm, dummy, "pyannet_lstm.onnx",
    opset_version=14,
    input_names=["input"],
    output_names=["output", "h_out", "c_out"],
    dynamic_axes={
        "input":  {0: "batch", 1: "time"},
        "output": {0: "batch", 1: "time"},
        "h_out":  {1: "batch"},
        "c_out":  {1: "batch"}
    }
)
```

2. **Build TensorRT engine**:

```bash
trtexec \
  --onnx=pyannet_lstm.onnx \
  --minShapes=input:1x60x293 \
  --optShapes=input:32x60x293 \
  --maxShapes=input:32x60x293 \
  --saveEngine=lstm.engine
```

---

### 🧪 Performance Benchmark

**PyTorch benchmark code**:

```python
x = torch.randn(32, 60, 293).cuda()
lstm.to("cuda").eval()

with torch.no_grad():
    for _ in range(1000):
        output, (h, c) = lstm(x)
```

* Average PyTorch time per batch: **\~0.5ms**
* Average TensorRT time per batch: **\~1.0ms**

---

### 🔍 Profiling Observation

Using **NVIDIA Nsight Systems**, I observed:

* PyTorch uses a fused kernel: `RNN_blockPersist_fp_LSTM`
* TensorRT seems to decompose the model into many small ops instead of using a fused LSTM kernel

---

### ❓Questions

* Is it expected that TensorRT does not fuse the LSTM into a single kernel like `RNN_blockPersist_fp_LSTM`?
* Are there flags or version requirements to enable such fusion?
* Is this a known limitation with ONNX -> TensorRT conversion for LSTM?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LSTM model converted to TensorRT is slower than PyTorch on RTX 4090 #4490

🧠 Problem Summary

📦 Model Description

🔄 Conversion Steps

🧪 Performance Benchmark

🔍 Profiling Observation

❓Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LSTM model converted to TensorRT is slower than PyTorch on RTX 4090 #4490

Description

🧠 Problem Summary

📦 Model Description

🔄 Conversion Steps

🧪 Performance Benchmark

🔍 Profiling Observation

❓Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions