I have benchmarked with resnet50, resnet101, the bn_fusion performance improves with CPU (about 7%), but no improvement with cuda.
There is no noticeably difference between torch.cuda.cudnn.benchmark true and false.
My guess is that cudnn can optimize such case really good already.
my test code: https://github.com/xuyuan/pytorch_bn_fusion/blob/master/test_convert_inference.py