Implement Bluestein's algorithm for large primes (>100)#107
Implement Bluestein's algorithm for large primes (>100)#107dannys4 merged 12 commits intoJuliaMath:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #107 +/- ##
==========================================
+ Coverage 96.22% 96.66% +0.44%
==========================================
Files 4 4
Lines 424 480 +56
==========================================
+ Hits 408 464 +56
Misses 16 16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
this is actually really impressive and gives me some hope for rader if we decided to implement it. Bluestein, tmk, is referred to as relatively impractical outside the discrete transform world (i.e., people like using it for chirp transforms and z transforms?) because of the big allocations. In fact, they hadn't actually implemented it when the 2005 FFTW paper was released. I agree that Bluestein is significantly easier than rader's to implement and, for that reason, significantly more maintainable. So this is definitely a first step. |
|
One thing to consider, however, is the allocation. In an ideal world, I would like to have the allocations all take place when the FFTAPlan is created so that the actual execution of the FFT is entirely arithmetic and memory-bound operations. Further, if you do several FFTs of the same size, you should absolutely pre-allocate the space. Unfortunately, this is really tricky due to Julia's behavior with parallelism. On the other hand, this algorithm is really intended for larger primes, so I don't believe it's fair to consider the allocation time as "free" (at least, the penalty incurred when bouncing around all different areas of memory). I'd like, if possible, to call in the wisdom of @andreasnoack to ask: What is the "best" way of working with pre-allocations for an FFTAPlan? Should we be doing so, or is it better right now to just allocate on-the-fly? Any thoughts? |
dannys4
left a comment
There was a problem hiding this comment.
I'm not not approving it---I just wanted to give some feedback/ask questions while considering the allocation thing.
src/callgraph.jl
Outdated
| push!(workspace, T[]) | ||
| push!(nodes, CallGraphNode(0, 0, DFT, N, s_in, s_out, w)) | ||
| # use Bluestein's algorithm for big primes | ||
| LEAF_ALG = N < 100 ? DFT : BLUESTEIN |
There was a problem hiding this comment.
Is 100 benchmarked, arbitrary, or in "the literature"? I'd like, if possible, to make this a bit configurable, because I'd bet that this is heavily influenced by system.
There was a problem hiding this comment.
It's from a bit of eyeballing at the benchmarks, yes, and it could be easily made configurable. It works decently well without tuning, maybe theoretically justified by counting muls and adds (I did not actually count...)
There was a problem hiding this comment.
Yeah, it's definitely tricky without auto-tuning for per-system performance. I personally think that an atomic global (to ensure thread safety) that is configurable and only checked here at planning time would be much better without any real overhead.
There was a problem hiding this comment.
I was gonna go the keyword argument route, but out of curiosity, how do you do that? Not too important because the cost is easily amortised, but doesn't that still incur the normal penalties for accessing globals?
There was a problem hiding this comment.
okay it occurred to me that atomics are only first class in Julia 1.11+, so definitely just do a kwarg for now. For future ref, you can do something like
mutable struct AtomicSwitchover @atomic n::Int; end
const BLUESTEIN_SWITCHOVER = AtomicSwitchover(100);
function change_bluestein_switch!(N::Int)
return (@atomic BLUESTEIN_SWITCHOVER.n = N)
endThere was a problem hiding this comment.
Btw, I changed the default to 73, because that's the break-even point for my pretty old machine, and it appears to be good on the GitHub runner too. How is it on yours?
|
okay I think a better plan here is probably: 1) make the |
|
No problem. I did think about pre-allocating stuff, but couldn't immediately see a good place to do that, so I put up the PR first in hopes that you would. Also, I suspected that since Bluestein is so heavy the memory allocation would be mostly amortised, and hopefully GC now is good enough to not really free those arrays across a 2D run, and re-allocate back the same memory so the overhead would still be acceptable, despite an alarming number of allocations on paper. |
|
Other than FFTW padding less, I suspect the radix-2/4 and probably radix-3 algorithm might just be more sensitive to numerical errors than whatever FFTW does, probably when computing the |
left _ispow24 alone just in case.
remove `@muladd` for (Int) index simplify `fft!` branches, tests zero(y)
|
sorry, been busy past week. |
|
Actually, shockingly, the following code gives 23 as the crossover for me. I prefer to be conservative here, though, because of different typings etc etc. using FFTA, BenchmarkTools, Primes
p = primes(200)
dft_times = [(@belapsed f*a setup=(a=randn(ComplexF64, $n); f=FFTA._plan_fft(a, 1, FFTA.FFT_FORWARD; BLUESTEIN_CUTOFF=10000))) for n in p]
bl_times = [(@belapsed f*a setup=(a=randn(ComplexF64, $n); f=FFTA._plan_fft(a, 1, FFTA.FFT_FORWARD; BLUESTEIN_CUTOFF=1))) for n in p]
p[findlast(bl_times .> dft_times)] |
Shocking indeed. OTOH, my machine must be ancient. |
Hard to say---if the problem is memory-bound then I believe it's more complicated than "new" vs. "old".
Painful to hear this if only because I started developing ffta when 1.6 just became LTS 🧓 |
Also because I used the mean time from |
In the process, noticed
mul!doesn't return the output array, which it is supposed to do according to the docs.Local benchmarks indicate that this is close enough to FFTW's performance, so not too bad, a big improvement on the previous O(N^2) DFT. Allocates, but that's not too important overall.
Perhaps Rader could be better here, but Bluestein's algorithm is much easier to write. Doesn't really affect anything else.