Implement Bluestein's algorithm for large primes (>100) by wheeheee · Pull Request #107 · JuliaMath/FFTA.jl

wheeheee · 2026-02-15T16:28:51Z

In the process, noticed mul! doesn't return the output array, which it is supposed to do according to the docs.

Local benchmarks indicate that this is close enough to FFTW's performance, so not too bad, a big improvement on the previous O(N^2) DFT. Allocates, but that's not too important overall.
Perhaps Rader could be better here, but Bluestein's algorithm is much easier to write. Doesn't really affect anything else.

codecov · 2026-02-15T16:32:32Z

Codecov Report

❌ Patch coverage is 99.14530% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 96.66%. Comparing base (da36b22) to head (473680b).
⚠️ Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
src/algos.jl	98.87%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
+ Coverage   96.22%   96.66%   +0.44%     
==========================================
  Files           4        4              
  Lines         424      480      +56     
==========================================
+ Hits          408      464      +56     
  Misses         16       16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dannys4 · 2026-02-15T16:56:40Z

this is actually really impressive and gives me some hope for rader if we decided to implement it. Bluestein, tmk, is referred to as relatively impractical outside the discrete transform world (i.e., people like using it for chirp transforms and z transforms?) because of the big allocations. In fact, they hadn't actually implemented it when the 2005 FFTW paper was released.

I agree that Bluestein is significantly easier than rader's to implement and, for that reason, significantly more maintainable. So this is definitely a first step.

dannys4 · 2026-02-15T17:01:13Z

One thing to consider, however, is the allocation. In an ideal world, I would like to have the allocations all take place when the FFTAPlan is created so that the actual execution of the FFT is entirely arithmetic and memory-bound operations. Further, if you do several FFTs of the same size, you should absolutely pre-allocate the space. Unfortunately, this is really tricky due to Julia's behavior with parallelism. On the other hand, this algorithm is really intended for larger primes, so I don't believe it's fair to consider the allocation time as "free" (at least, the penalty incurred when bouncing around all different areas of memory).

I'd like, if possible, to call in the wisdom of @andreasnoack to ask: What is the "best" way of working with pre-allocations for an FFTAPlan? Should we be doing so, or is it better right now to just allocate on-the-fly? Any thoughts?

dannys4

I'm not not approving it---I just wanted to give some feedback/ask questions while considering the allocation thing.

dannys4 · 2026-02-15T17:18:05Z

src/callgraph.jl

        push!(workspace, T[])
-        push!(nodes, CallGraphNode(0, 0, DFT, N, s_in, s_out, w))
+        # use Bluestein's algorithm for big primes
+        LEAF_ALG = N < 100 ? DFT : BLUESTEIN


Is 100 benchmarked, arbitrary, or in "the literature"? I'd like, if possible, to make this a bit configurable, because I'd bet that this is heavily influenced by system.

It's from a bit of eyeballing at the benchmarks, yes, and it could be easily made configurable. It works decently well without tuning, maybe theoretically justified by counting muls and adds (I did not actually count...)

Yeah, it's definitely tricky without auto-tuning for per-system performance. I personally think that an atomic global (to ensure thread safety) that is configurable and only checked here at planning time would be much better without any real overhead.

I was gonna go the keyword argument route, but out of curiosity, how do you do that? Not too important because the cost is easily amortised, but doesn't that still incur the normal penalties for accessing globals?

okay it occurred to me that atomics are only first class in Julia 1.11+, so definitely just do a kwarg for now. For future ref, you can do something like

mutable struct AtomicSwitchover @atomic n::Int; end const BLUESTEIN_SWITCHOVER = AtomicSwitchover(100); function change_bluestein_switch!(N::Int) return (@atomic BLUESTEIN_SWITCHOVER.n = N) end

Btw, I changed the default to 73, because that's the break-even point for my pretty old machine, and it appears to be good on the GitHub runner too. How is it on yours?

src/algos.jl

dannys4 · 2026-02-15T17:30:36Z

okay I think a better plan here is probably: 1) make the 100 configurable, 2) add a test that actually calls bluestein, 3) I'll approve the PR, and then 4) add issues on improving the performance. The allocations (at least in the 2d case) can easily be a real slog, so I want to ensure that the user can keep old behavior if they would like it.

wheeheee · 2026-02-16T01:20:48Z

No problem. I did think about pre-allocating stuff, but couldn't immediately see a good place to do that, so I put up the PR first in hopes that you would. Also, I suspected that since Bluestein is so heavy the memory allocation would be mostly amortised, and hopefully GC now is good enough to not really free those arrays across a 2D run, and re-allocate back the same memory so the overhead would still be acceptable, despite an alarming number of allocations on paper.

src/algos.jl

wheeheee · 2026-02-22T09:57:29Z

Other than FFTW padding less, I suspect the radix-2/4 and probably radix-3 algorithm might just be more sensitive to numerical errors than whatever FFTW does, probably when computing the ws by repeated squaring/multiplication from the smallest root of unity whereas (I think) FFTW precomputes them, although I still don't get to 1e-14 error by modifying the algorithm to use Double64 for w.

left _ispow24 alone just in case.

remove `@muladd` for (Int) index simplify `fft!` branches, tests zero(y)

src/algos.jl

dannys4 · 2026-02-23T14:44:59Z

sorry, been busy past week.

dannys4 · 2026-02-23T15:28:19Z

Actually, shockingly, the following code gives 23 as the crossover for me. I prefer to be conservative here, though, because of different typings etc etc.

using FFTA, BenchmarkTools, Primes
p = primes(200)
dft_times = [(@belapsed f*a setup=(a=randn(ComplexF64, $n); f=FFTA._plan_fft(a, 1, FFTA.FFT_FORWARD; BLUESTEIN_CUTOFF=10000))) for n in p]
bl_times = [(@belapsed f*a setup=(a=randn(ComplexF64, $n); f=FFTA._plan_fft(a, 1, FFTA.FFT_FORWARD; BLUESTEIN_CUTOFF=1))) for n in p]
p[findlast(bl_times .> dft_times)]

wheeheee · 2026-02-23T15:34:07Z

Actually, shockingly, the following code gives 23 as the crossover for me. I prefer to be conservative here, though, because of different typings etc etc.

Shocking indeed. OTOH, my machine must be ancient.
Btw, the test failure on 1.6 for N=73^2...I hope it's a Julia compiler bug, it's fine on everything else. But 1.6 is out of support anyway, so maybe it is time to consider raising compat to 1.10?

dannys4 · 2026-02-23T15:42:26Z

my machine must be ancient

Hard to say---if the problem is memory-bound then I believe it's more complicated than "new" vs. "old".

1.6 is out of support

Painful to hear this if only because I started developing ffta when 1.6 just became LTS 🧓

wheeheee · 2026-02-24T09:50:52Z

Hard to say---if the problem is memory-bound then I believe it's more complicated than "new" vs. "old".

Also because I used the mean time from @benchmark, mul! with a preallocated out array, and generally GC jitter. On a second run, the cutoff for me is around 43/47 depending on min/mean. But this all fits comfortably into L1/L2 cache assuming GC isn't too bad, so I think the difference is mainly due to the CPU.
Anyway, 73 is rather safe and looks fine on the benchmark html so it's probably ok as a default.

wheeheee mentioned this pull request Feb 15, 2026

PFA implementation #105

Open

dannys4 reviewed Feb 15, 2026

View reviewed changes

src/algos.jl Outdated Show resolved Hide resolved

src/algos.jl Outdated Show resolved Hide resolved

wheeheee force-pushed the bluestein branch from b801500 to 412bbf5 Compare February 16, 2026 07:14

wheeheee commented Feb 16, 2026

View reviewed changes

src/algos.jl Show resolved Hide resolved

wheeheee added 8 commits February 22, 2026 18:33

adhere to mul! contract

ee30bed

add bluestein's algorithm for large primes

cf0fb63

slight cleanup

1cf9b5c

left _ispow24 alone just in case.

add bluestein kwarg

94e74b8

w4, mod to reduce chirp error

42a5454

housekeeping

7167518

remove `@muladd` for (Int) index simplify `fft!` branches, tests zero(y)

fft_composite!: directly call bluestein on child nodes

6a4b78c

prealloc_blue

79f7a3e

wheeheee force-pushed the bluestein branch from 99b47b3 to 79f7a3e Compare February 22, 2026 10:45

wheeheee added 4 commits February 22, 2026 19:21

test preallocating fft_bluestein!

47e3161

comment out _ispow24

3ba6b7c

prevent overflow

bc5f910

more precise naive transforms

473680b

dannys4 requested changes Feb 23, 2026

View reviewed changes

src/algos.jl Show resolved Hide resolved

src/algos.jl Show resolved Hide resolved

dannys4 approved these changes Feb 23, 2026

View reviewed changes

dannys4 added this pull request to the merge queue Feb 23, 2026

Merged via the queue into JuliaMath:main with commit bf2c0eb Feb 23, 2026
7 of 8 checks passed

dannys4 mentioned this pull request Feb 23, 2026

Some FFTs may fail at older versions of Julia #110

Open

wheeheee deleted the bluestein branch February 25, 2026 04:57

wheeheee mentioned this pull request Feb 26, 2026

Experiment: raise 1.6 compat to test CI failures #111

Merged

Uh oh!

Conversation

wheeheee commented Feb 15, 2026

Uh oh!

codecov bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dannys4 commented Feb 15, 2026

Uh oh!

dannys4 commented Feb 15, 2026

Uh oh!

dannys4 left a comment

Choose a reason for hiding this comment

Uh oh!

dannys4 Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

wheeheee Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

dannys4 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

wheeheee Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

dannys4 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

wheeheee Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dannys4 commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wheeheee commented Feb 16, 2026

Uh oh!

Uh oh!

wheeheee commented Feb 22, 2026

Uh oh!

Uh oh!

Uh oh!

dannys4 commented Feb 23, 2026

Uh oh!

dannys4 commented Feb 23, 2026

Uh oh!

Uh oh!

wheeheee commented Feb 23, 2026

Uh oh!

dannys4 commented Feb 23, 2026

Uh oh!

wheeheee commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 15, 2026 •

edited

Loading

dannys4 commented Feb 15, 2026 •

edited

Loading