Description
The performence of TRT-fp32 and OnnxRuntime is equal to the original Pytorch model, while there is obvious performance degradation in TRT-fp16,what is the reason and how to solve it?

Environment
Pytorch: 2.0.0
CUDA: 11.4
Cudnn: 8.6.0
TensorRT: 8.5-GA
Graphic Cards: Nvidia A100
GPU Driver version: 515.86.01
Operating System: Ubuntu 20.04
Python: 3.10
If there's need to modify some layers or operation of model to improve the performance of TRT-fp16, how to locate these layers or operations?