Why does rfft transpose use naive FFT instead of irfft? #34553

mariogeiger · 2026-01-22T11:12:44Z

mariogeiger
Jan 22, 2026

I noticed that _rfft_transpose in jax/_src/lax/fft.py computes the transpose by:

Defining a naive rfft as fft(x) + slice
Using linear_transpose on that naive implementation

def _naive_rfft(x, fft_lengths):
    y = fft(x, FftType.FFT, fft_lengths)
    return y[..., : n // 2 + 1]

def _rfft_transpose(t, fft_lengths):
    # ... we rely JAX to transpose a naive RFFT implementation.
    transpose = linear_transpose(partial(_naive_rfft, fft_lengths=fft_lengths), dummy_primal)
    return transpose(t)

This effectively runs a full complex FFT. However, the transpose can be computed directly using irfft with a mask:

def _rfft_transpose(t, fft_lengths):
    N = math.prod(fft_lengths)
    mask = [1, 2, 2, ..., 2, 1]  # accounts for Hermitian redundancy
    return irfft(conj(t) / mask) * N

Since irfft exploits Hermitian symmetry, it's roughly 2x faster and uses half the memory of a full FFT.

The comment mentions avoiding "manually building up larger twiddle matrices," but the mask-based approach doesn't require twiddle matrices—just element-wise division by [1, 2, ..., 2, 1].

Is there a correctness concern I'm missing, or would a PR switching to the irfft-based transpose be welcome?

hawkinsp · 2026-01-22T13:37:24Z

hawkinsp
Jan 22, 2026
Maintainer

That sounds plausible to me. I suspect it's simply we didn't realize that when writing the code. PRs welcome!

10 replies

mariogeiger Jan 24, 2026
Author

Luckily it's all the same for xla? xla is quite good to optimize little things like that I believe

unalmis Jan 24, 2026

In my experience, memory allocation is not optimized. You can get large reductions by using jnp. Einsum instead of broadcast and multiply for example.

michael-0brien Jan 30, 2026

+1 to @unalmis. For version release can this implementation be further optimized? The performance of this function is crucial for many applications.

mariogeiger Feb 10, 2026
Author

When will the change be release to pypi?

hawkinsp Feb 10, 2026
Maintainer

Next scheduled release is Feb 16.

But you should try a nightly right now and check it is as you would like.

PRs are welcome to improve this further, if you can demonstrate it helps.

michael-0brien · 2026-01-30T20:44:59Z

michael-0brien
Jan 30, 2026

Hello! See my comment above ^. To get discussion going, here is a minimal example of running the new transpose rule and the corresponding HLO dump. I apologize for the dump as I am still learning how to read these; could someone help out?

Hoping to get some discussion going to make sure the new implementation has optimal memory performance. In particular, we should make sure that no additional arrays besides the input are allocated, especially which increase in size under vmap.

import jax, jax.numpy as jnp

def fn(arr):
    return jnp.fft.rfftn(arr)

arr = jax.random.normal(key=jax.random.key(1234), shape=(10, 10))
primals_out, vjp_fn = jax.vjp(fn, arr)
vjp_fn_jit = jax.jit(vjp_fn)
print(vjp_fn_jit.lower(primals_out).compile().as_text())

%fused_computation (param_0: f32[10,10]) -> f32[10,10] {
  %param_0 = f32[10,10]{1,0} parameter(0)
  %constant.20 = f32[] constant(100), metadata={op_name=...
  %mul.7 = f32[10,10]{1,0} broadcast(%constant.20), dimensions={}, metadata=...
  ROOT %mul.6 = f32[10,10]{1,0} multiply(%param_0, %mul.7), metadata=...
}

%fused_computation.1 (param_0.2: c64[6], param_1.5: c64[10,6]) -> c64[10,6] {
  %param_1.5 = c64[10,6]{1,0} parameter(1)
  %real.3 = f32[10,6]{1,0} real(%param_1.5), metadata=...
  %imag.3 = f32[10,6]{1,0} imag(%param_1.5), metadata=...
  %neg.3 = f32[10,6]{1,0} negate(%imag.3), metadata=...
  %complex.3 = c64[10,6]{1,0} complex(%real.3, %neg.3), metadata=...
  %param_0.2 = c64[6]{0} parameter(0)
  %broadcast.4 = c64[10,6]{1,0} broadcast(%param_0.2), dimensions={1}, metadata=...
  ROOT %multiply.1 = c64[10,6]{1,0} multiply(%complex.3, %broadcast.4), metadata=...
}

ENTRY %main.4 (out_ct.1: c64[10,6]) -> f32[10,10] {
  %out_ct.1 = c64[10,6]{1,0} parameter(0), metadata=...
  %constant.19 = c64[6]{0} constant({(1, 0), (0.5, 0), (0.5, 0), (0.5, 0), (0.5, 0), (1, 0)}), metadata=...
  %broadcast_multiply_fusion.1 = c64[10,6]{1,0} fusion(%constant.19, %out_ct.1), kind=kLoop, calls=%fused_computation.1, metadata=...
  %fft.3 = f32[10,10]{1,0} fft(%broadcast_multiply_fusion.1), fft_type=IRFFT, fft_length={10,10}, metadata=...
  ROOT %broadcast_multiply_fusion = f32[10,10]{1,0} fusion(%fft.3), kind=kLoop, calls=%fused_computation, metadata=...
}

Please let me know if I am understanding correctly when the new transpose will be invoked.

1 reply

michael-0brien Jan 31, 2026

Below is the same dump under jax.vmap (batch dim = 1). If I'm not mistaken this looks pretty good? The mask is broadcasted inside the fusion.

%fused_computation (param_0: f32[1,10,10]) -> f32[1,10,10] {
  %param_0 = f32[1,10,10]{2,1,0} parameter(0)
  %constant.26 = f32[] constant(100), metadata=...
  %mul.9 = f32[1,10,10]{2,1,0} broadcast(%constant.26), dimensions={}, metadata=...
  ROOT %mul.8 = f32[1,10,10]{2,1,0} multiply(%param_0, %mul.9), metadata=...
}

%fused_computation.1 (param_0.2: c64[6], param_1.5: c64[1,10,6]) -> c64[1,10,6] {
  %param_1.5 = c64[1,10,6]{2,1,0} parameter(1)
  %real.4 = f32[1,10,6]{2,1,0} real(%param_1.5), metadata=...
  %imag.4 = f32[1,10,6]{2,1,0} imag(%param_1.5), metadata=...
  %neg.4 = f32[1,10,6]{2,1,0} negate(%imag.4), metadata=...
  %complex.4 = c64[1,10,6]{2,1,0} complex(%real.4, %neg.4), metadata=...
  %param_0.2 = c64[6]{0} parameter(0)
  %broadcast.6 = c64[1,10,6]{2,1,0} broadcast(%param_0.2), dimensions={2}, metadata=...
  ROOT %multiply.1 = c64[1,10,6]{2,1,0} multiply(%complex.4, %broadcast.6), metadata=...

ENTRY %main.5 (out_ct.1: c64[1,10,6]) -> f32[1,10,10] {
  %out_ct.1 = c64[1,10,6]{2,1,0} parameter(0), metadata={op_name="out_ct"}
  %constant.25 = c64[6]{0} constant({(1, 0), (0.5, 0), (0.5, 0), (0.5, 0), (0.5, 0), (1, 0)}), metadata=...
  %broadcast_multiply_fusion.1 = c64[1,10,6]{2,1,0} fusion(%constant.25, %out_ct.1), kind=kLoop, calls=%fused_computation.1, metadata=...
  %fft.4 = f32[1,10,10]{2,1,0} fft(%broadcast_multiply_fusion.1), fft_type=IRFFT, fft_length={10,10}, metadata=...
  ROOT %broadcast_multiply_fusion = f32[1,10,10]{2,1,0} fusion(%fft.4), kind=kLoop, calls=%fused_computation, metadata=...

michael-0brien · 2026-01-31T07:10:35Z

michael-0brien
Jan 31, 2026

I also tried an implementation as @unalmis suggests:

def _rfft_transpose(t, fft_lengths):
    if fft_lengths[-1] % 2 == 0:
        t = t.at[..., 1:-1].divide(2.0, indices_are_sorted=True, unique_indices=True)
    else:
        t = t.at[..., 1:].divide(2.0, indices_are_sorted=True, unique_indices=True)

    N = math.prod(fft_lengths)
    out = N * fft(lax.conj(t), FftType.IRFFT, fft_lengths)
    assert out.dtype == _real_dtype(t.dtype), (out.dtype, t.dtype)
    return out

Interestingly, this yielded an HLO dump that seems worse. Here it is under vmap:

%fused_computation (param_0: f32[1,10,10]) -> f32[1,10,10] {
  %param_0 = f32[1,10,10]{2,1,0} parameter(0)
  %constant.18 = f32[] constant(100), metadata=...
  %mul.10 = f32[1,10,10]{2,1,0} broadcast(%constant.18), dimensions={}, metadata=...
  ROOT %mul.9 = f32[1,10,10]{2,1,0} multiply(%param_0, %mul.10), metadata=...
}

%fused_computation.1 (param_0.3: c64[1,10,6], param_1.5: c64[1,10,6]) -> c64[1,10,6] {
  %param_0.3 = c64[1,10,6]{2,1,0} parameter(0)
  %param_1.5 = c64[1,10,6]{2,1,0} parameter(1)
  %multiply.1 = c64[1,10,6]{2,1,0} multiply(%param_0.3, %param_1.5), metadata=...
  %real.4 = f32[1,10,6]{2,1,0} real(%multiply.1), metadata=...
  %imag.4 = f32[1,10,6]{2,1,0} imag(%multiply.1), metadata=...
  %neg.4 = f32[1,10,6]{2,1,0} negate(%imag.4), metadata=...
  ROOT %complex.4 = c64[1,10,6]{2,1,0} complex(%real.4, %neg.4), metadata=...
}

ENTRY %main.6 (out_ct.1: c64[1,10,6]) -> f32[1,10,10] {
  %out_ct.1 = c64[1,10,6]{2,1,0} parameter(0), metadata=...
  %constant.17 = c64[1,10,6]{2,1,0} constant({...}), metadata=...
  %negate_complex_fusion = c64[1,10,6]{2,1,0} fusion(%out_ct.1, %constant.17), kind=kLoop, calls=%fused_computation.1, metadata=...
  %fft.4 = f32[1,10,10]{2,1,0} fft(%negate_complex_fusion), fft_type=IRFFT, fft_length={10,10}, metadata=...
  ROOT %broadcast_multiply_fusion = f32[1,10,10]{2,1,0} fusion(%fft.4), kind=kLoop, calls=%fused_computation, metadata=...
}

There is a constant that gets broadcasted to the size of the input array? I find this confusing if anyone has insight.

4 replies

jakevdp Feb 10, 2026
Maintainer

Keep in mind that the details of compiled HLO will generally differ greatly depending on the target device (i.e. XLA:CPU vs XLA:GPU vs XLA:TPU) so a lot of care is necessary when doing these kinds of micro-optimizations. For this reason, we generally avoid this kind of thing in JAX-level implemntations unless the benefit is completely obvious. It's not clear to me that the optimizations discussed here rise to that level.

michael-0brien Feb 10, 2026

Interesting, makes sense. My main worry was about doubling memory requirements even under vmap, which would detract from the benefits of the RFFT (the point is taken but I wouldn’t consider this a micro optimization!). But anyways at this crude level it doesn’t seem like that is a concern with the current implementation.

jakevdp Feb 10, 2026
Maintainer

If you're generating compiled HLO on CPU, you should also try on GPU and TPU to see if you see the same issue (I tried the initial code in this thread and didn't see the constant when lowering via XLA:GPU – I didn't check the vmapped version and other alternatives because it's not clear to me exactly what you're running to generate the HLO you report).

Also worth keeping in mind: although scatter/gather operations like the one you're proposing tend to be relatively performant on CPU, it can be quite slow on accelerators.

With performance across platforms varying so much based on the details of the implementation, it's hard to make many conclusions from the details of HLO dumps on a single platform.

michael-0brien Feb 10, 2026

That makes sense, thanks for checking this out! And this makes sense about the scatter/gather operations on GPU, I see why it is advantageous to use the mask.

Also here was the code for the vmap case:

import jax, jax.numpy as jnp

@jax.vmap
def fn(arr):
    return jnp.fft.rfftn(arr)

arr = jax.random.normal(key=jax.random.key(1234), shape=(10, 10))[None]
primals_out, vjp_fn = jax.vjp(fn, arr)
vjp_fn_jit = jax.jit(vjp_fn)
print(vjp_fn_jit.lower(primals_out).compile().as_text())

Why does rfft transpose use naive FFT instead of irfft? #34553

Uh oh!

Replies: 3 comments · 15 replies

Uh oh!

hawkinsp Jan 22, 2026 Maintainer

Uh oh!

mariogeiger Jan 24, 2026 Author

Uh oh!

Uh oh!

Uh oh!

mariogeiger Feb 10, 2026 Author

Uh oh!

hawkinsp Feb 10, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakevdp Feb 10, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

jakevdp Feb 10, 2026 Maintainer

Uh oh!

Replies: 3 comments 15 replies

hawkinsp
Jan 22, 2026
Maintainer

mariogeiger Jan 24, 2026
Author

mariogeiger Feb 10, 2026
Author

hawkinsp Feb 10, 2026
Maintainer

jakevdp Feb 10, 2026
Maintainer

jakevdp Feb 10, 2026
Maintainer