I've been running various DFT calculations with PSI4 inside Intel's VTune and gg_fast_transpose popped up a top hotspot As as test I exchanged it to gg_naive_transpose and saw a significant speed up (50% for a C60 test) for that function.
It's only 4-5% of the total CPU time for single-points, so no real bottleneck to worry about, but wondering why the blocked-transpose might be so much slower.