Faster pow4 kernel by andreasnoack · Pull Request #63 · JuliaMath/FFTA.jl

andreasnoack · 2025-03-25T13:47:06Z

It's a little silly that we have to do so much work that should really be handled by the compiler but these changes give me 25% speedup for a size 256 problem

Current main

julia> @btime FFTA.fft_pow4!(y, x, length(x), 1, 1, 1, 1, cispi(-2/length(x))) setup=(x = complex.(randn(256)); y = similar(x))
  1.742 μs (0 allocations: 0 bytes)

This PR

julia> @btime FFTA.fft_pow4!(y, x, length(x), 1, 1, 1, 1, cispi(-2/length(x))) setup=(x = complex.(randn(256)); y = similar(x))
  1.208 μs (0 allocations: 0 bytes)

dannys4

To be honest, kind of weird that plusi is removed and minusi is kept, as you multiply it by the negative when calculating xoe_m_xoo and ỹ_koe_m_ỹ_koo, but changing this doesn't substantially change performance as measured on my personal machine, so I don't think it matters too much.

I also looked into the removal of the @muladd. Since most of the muladds are int-ops, the compiler should be able to optimize this no problem. The fp muladds that were in 220-228, I would hope, would give better performance (without fastmath, the compiler shouldn't be able to reorder your operations to perform these FMAs), but it seems this is not the case according to the experiments.

None of these are actionable items, it's just premature optimizations on my part, and I wanted to put this in writing for "posterity".

andreasnoack · 2025-03-26T10:30:10Z

To be honest, kind of weird that plusi is removed and minusi is kept, as you multiply it by the negative when calculating xoe_m_xoo and ỹ_koe_m_ỹ_koo, but changing this doesn't substantially change performance as measured on my personal machine, so I don't think it matters too much.

Indeed, but I initially got the sign wrong and then added the - afterwards.

Regarding the @muladd then it generally seems to be a mixed back when I profile on my machine. For some of the kernels it seems to help and not for others. I also noticed that some vector instructions sometimes appear without the muladd. These things might also be machine dependent. I ended up browsing the FFTW paper and noticed that they gave up on this approach to speeding things up (with SIMD) and instead relied on SIMD by splitting the complex FFT into to real FFTs. It would require some work here for us to mimic that but it is useful to know that we shouldn't expect easy wins from fancy instructions. Most of the gain will probably come from implementing more and larger kernel sizes.

Faster pow4 kernel

d72cf3d

andreasnoack requested a review from dannys4 March 25, 2025 13:47

dannys4 approved these changes Mar 25, 2025

View reviewed changes

Merge branch 'main' into an/fasterpow4

3d13464

dannys4 added this pull request to the merge queue Mar 25, 2025

Merged via the queue into main with commit 7107d46 Mar 25, 2025
4 checks passed

andreasnoack deleted the an/fasterpow4 branch March 26, 2025 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster pow4 kernel#63

Faster pow4 kernel#63
dannys4 merged 2 commits intomainfrom
an/fasterpow4

andreasnoack commented Mar 25, 2025

Uh oh!

dannys4 left a comment

Uh oh!

Uh oh!

andreasnoack commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreasnoack commented Mar 25, 2025

Current main

This PR

Uh oh!

dannys4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreasnoack commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants