When looking at the pfb shape generation I noticed doing it on the gpu results in a shape that seems shifted (like fftshift). This only seems to happen when the sum step is done with the cuda kernel. Each other step can be done on cpu or this step can be done on cpu at the result is as expected. Look in to the striding of the outputs matching input striding.