The https://github.com/FALCONN-LIB/FFHT repository implements a specialized SIMD FHT, which it says is faster than FFTW for this problem. (Totally possible, since 2×2×2×⋯ multidimensional transforms are not that optimized in FFTW — they are using size-2 DFT kernels, whereas to be fast they should really use larger multidimensional kernels.) Might be interesting to support using FFHT (which is MIT-licensed) in cases where it is applicable.