-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
When I used tensorrt8.6.11 to export a standard transformer model, I found that the results were consistent with those of ONNX when running at FP32 precision, and the inference accuracy remained very good, but the FP16 inference accuracy was lost a lot. It should be emphasized that I have used opset version 17 to export onnx, and the engine was generated smoothly.
Through polygraph debugging, I found that layernorm caused the loss of precision, so I tried to force layernorm to FP32 alone, but graph fusion would turn the entire transformer into FP32 precision. I also tried to set some other ops inside the transformer, such as gemm, which would also cause the entire transformer to be converted to FP32 precision inference. I hope that layernorm will remain in fp32 and other operators will remain in fp16, so as to achieve higher accuracy while maintaining efficient inference. How can I achieve this?
Environment
TensorRT Version: 8.6.11
NVIDIA GPU: RTX 2070 and Drive ORIN-X
NVIDIA Driver Version: 530.41.03
CUDA Version: 11.7 and 12.1
CUDNN Version: 9.7.1
Operating System:
Python Version (if applicable): 3.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable): 1.13.1
Baremetal or Container (if so, version):