Skip to content

Implement Bluestein's algorithm for large primes (>100)#107

Merged
dannys4 merged 12 commits intoJuliaMath:mainfrom
wheeheee:bluestein
Feb 23, 2026
Merged

Implement Bluestein's algorithm for large primes (>100)#107
dannys4 merged 12 commits intoJuliaMath:mainfrom
wheeheee:bluestein

Conversation

@wheeheee
Copy link
Contributor

In the process, noticed mul! doesn't return the output array, which it is supposed to do according to the docs.

Local benchmarks indicate that this is close enough to FFTW's performance, so not too bad, a big improvement on the previous O(N^2) DFT. Allocates, but that's not too important overall.
Perhaps Rader could be better here, but Bluestein's algorithm is much easier to write. Doesn't really affect anything else.

@wheeheee wheeheee mentioned this pull request Feb 15, 2026
@codecov
Copy link

codecov bot commented Feb 15, 2026

Codecov Report

❌ Patch coverage is 99.14530% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 96.66%. Comparing base (da36b22) to head (473680b).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
src/algos.jl 98.87% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
+ Coverage   96.22%   96.66%   +0.44%     
==========================================
  Files           4        4              
  Lines         424      480      +56     
==========================================
+ Hits          408      464      +56     
  Misses         16       16              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 15, 2026

this is actually really impressive and gives me some hope for rader if we decided to implement it. Bluestein, tmk, is referred to as relatively impractical outside the discrete transform world (i.e., people like using it for chirp transforms and z transforms?) because of the big allocations. In fact, they hadn't actually implemented it when the 2005 FFTW paper was released.

I agree that Bluestein is significantly easier than rader's to implement and, for that reason, significantly more maintainable. So this is definitely a first step.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 15, 2026

One thing to consider, however, is the allocation. In an ideal world, I would like to have the allocations all take place when the FFTAPlan is created so that the actual execution of the FFT is entirely arithmetic and memory-bound operations. Further, if you do several FFTs of the same size, you should absolutely pre-allocate the space. Unfortunately, this is really tricky due to Julia's behavior with parallelism. On the other hand, this algorithm is really intended for larger primes, so I don't believe it's fair to consider the allocation time as "free" (at least, the penalty incurred when bouncing around all different areas of memory).

I'd like, if possible, to call in the wisdom of @andreasnoack to ask: What is the "best" way of working with pre-allocations for an FFTAPlan? Should we be doing so, or is it better right now to just allocate on-the-fly? Any thoughts?

Copy link
Collaborator

@dannys4 dannys4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not not approving it---I just wanted to give some feedback/ask questions while considering the allocation thing.

src/callgraph.jl Outdated
push!(workspace, T[])
push!(nodes, CallGraphNode(0, 0, DFT, N, s_in, s_out, w))
# use Bluestein's algorithm for big primes
LEAF_ALG = N < 100 ? DFT : BLUESTEIN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 100 benchmarked, arbitrary, or in "the literature"? I'd like, if possible, to make this a bit configurable, because I'd bet that this is heavily influenced by system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from a bit of eyeballing at the benchmarks, yes, and it could be easily made configurable. It works decently well without tuning, maybe theoretically justified by counting muls and adds (I did not actually count...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's definitely tricky without auto-tuning for per-system performance. I personally think that an atomic global (to ensure thread safety) that is configurable and only checked here at planning time would be much better without any real overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna go the keyword argument route, but out of curiosity, how do you do that? Not too important because the cost is easily amortised, but doesn't that still incur the normal penalties for accessing globals?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay it occurred to me that atomics are only first class in Julia 1.11+, so definitely just do a kwarg for now. For future ref, you can do something like

mutable struct AtomicSwitchover @atomic n::Int; end
const BLUESTEIN_SWITCHOVER = AtomicSwitchover(100);
function change_bluestein_switch!(N::Int) 
    return (@atomic BLUESTEIN_SWITCHOVER.n = N)
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I changed the default to 73, because that's the break-even point for my pretty old machine, and it appears to be good on the GitHub runner too. How is it on yours?

@dannys4
Copy link
Collaborator

dannys4 commented Feb 15, 2026

okay I think a better plan here is probably: 1) make the 100 configurable, 2) add a test that actually calls bluestein, 3) I'll approve the PR, and then 4) add issues on improving the performance. The allocations (at least in the 2d case) can easily be a real slog, so I want to ensure that the user can keep old behavior if they would like it.

@wheeheee
Copy link
Contributor Author

No problem. I did think about pre-allocating stuff, but couldn't immediately see a good place to do that, so I put up the PR first in hopes that you would. Also, I suspected that since Bluestein is so heavy the memory allocation would be mostly amortised, and hopefully GC now is good enough to not really free those arrays across a 2D run, and re-allocate back the same memory so the overhead would still be acceptable, despite an alarming number of allocations on paper.

@wheeheee
Copy link
Contributor Author

Other than FFTW padding less, I suspect the radix-2/4 and probably radix-3 algorithm might just be more sensitive to numerical errors than whatever FFTW does, probably when computing the ws by repeated squaring/multiplication from the smallest root of unity whereas (I think) FFTW precomputes them, although I still don't get to 1e-14 error by modifying the algorithm to use Double64 for w.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 23, 2026

sorry, been busy past week.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 23, 2026

Actually, shockingly, the following code gives 23 as the crossover for me. I prefer to be conservative here, though, because of different typings etc etc.

using FFTA, BenchmarkTools, Primes
p = primes(200)
dft_times = [(@belapsed f*a setup=(a=randn(ComplexF64, $n); f=FFTA._plan_fft(a, 1, FFTA.FFT_FORWARD; BLUESTEIN_CUTOFF=10000))) for n in p]
bl_times = [(@belapsed f*a setup=(a=randn(ComplexF64, $n); f=FFTA._plan_fft(a, 1, FFTA.FFT_FORWARD; BLUESTEIN_CUTOFF=1))) for n in p]
p[findlast(bl_times .> dft_times)]

@dannys4 dannys4 added this pull request to the merge queue Feb 23, 2026
Merged via the queue into JuliaMath:main with commit bf2c0eb Feb 23, 2026
7 of 8 checks passed
@wheeheee
Copy link
Contributor Author

Actually, shockingly, the following code gives 23 as the crossover for me. I prefer to be conservative here, though, because of different typings etc etc.

Shocking indeed. OTOH, my machine must be ancient.
Btw, the test failure on 1.6 for N=73^2...I hope it's a Julia compiler bug, it's fine on everything else. But 1.6 is out of support anyway, so maybe it is time to consider raising compat to 1.10?

@dannys4
Copy link
Collaborator

dannys4 commented Feb 23, 2026

my machine must be ancient

Hard to say---if the problem is memory-bound then I believe it's more complicated than "new" vs. "old".

1.6 is out of support

Painful to hear this if only because I started developing ffta when 1.6 just became LTS 🧓

@wheeheee
Copy link
Contributor Author

Hard to say---if the problem is memory-bound then I believe it's more complicated than "new" vs. "old".

Also because I used the mean time from @benchmark, mul! with a preallocated out array, and generally GC jitter. On a second run, the cutoff for me is around 43/47 depending on min/mean. But this all fits comfortably into L1/L2 cache assuming GC isn't too bad, so I think the difference is mainly due to the CPU.
Anyway, 73 is rather safe and looks fine on the benchmark html so it's probably ok as a default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants