During the powerfit search, rotate_image3d OpenCL kernel rotates the template structure and mask to each rotation that is to be tested. This cl kernel is quite complex and has a nested for loop 3 levels deep. It currently accounts for ~1/3rd of the processing time;
It might be possible to speed up this kernel, and reduce the gaps between the rotate_image3d kernel and the other kernels. However, the exact way the kernel works is not clear to me.