-
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Problem
Hello, thank you for your work on this project.
I encountered a performance issue while deploying the Transformer model and would like to ask whether this is a known limitation or whether there is a recommended deployment approach.
I tested inference performance with a fixed input shape of (4, 3, 256, 256) and observed that ONNX Runtime GPU is significantly slower than PyTorch:
- PyTorch: 0.17 ~ 0.19 s / batch
- Python ONNX Runtime GPU: 0.35 ~ 0.40 s / batch
- C++ ONNX Runtime GPU: 0.35 ~ 0.40 s / batch
In this case, ONNX Runtime GPU is about 2x slower than PyTorch.
Also, Python ORT and C++ ORT show very similar latency, so this does not appear to be caused by Python wrapper overhead.
What I have checked
I have already tried the following:
- Removed the
styleoutput- This made almost no difference.
- Exported the model with static batch and static input shape
- This also made almost no difference.
- Verified through profiling that the main computation is running on CUDAExecutionProvider.
- Checked the main hotspots in profiling:
- the first
Conv - later
Gemm/MatMul - some
Reshape/Transposeops
- the first
- When setting
cudnn_conv_algo_search=DEFAULT, the log shows that Conv runs in Fallback mode, and performance becomes even worse.
Questions
I would like to ask:
- Have you tested this Transformer model on ONNX Runtime GPU and compared its performance against PyTorch?
- Is there a recommended ONNX export method or deployment configuration for this model?
- In your experience, is this model better suited for TensorRT than for ONNX Runtime CUDAExecutionProvider?
Additional information
If needed, I can also provide:
- ONNX export code
- ONNX Runtime profiling results
- a minimal reproducible script
Thank you.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels