see https://github.com/EnzymeAD/Enzyme-JAX/pull/2121 for an example. the current wrap impl is a pad and two rotates. We should be able to do this in one, maybe two collective permutes