Skip to content

Faster pow4 kernel#63

Merged
dannys4 merged 2 commits intomainfrom
an/fasterpow4
Mar 25, 2025
Merged

Faster pow4 kernel#63
dannys4 merged 2 commits intomainfrom
an/fasterpow4

Conversation

@andreasnoack
Copy link
Member

It's a little silly that we have to do so much work that should really be handled by the compiler but these changes give me 25% speedup for a size 256 problem

Current main

julia> @btime FFTA.fft_pow4!(y, x, length(x), 1, 1, 1, 1, cispi(-2/length(x))) setup=(x = complex.(randn(256)); y = similar(x))
  1.742 μs (0 allocations: 0 bytes)

This PR

julia> @btime FFTA.fft_pow4!(y, x, length(x), 1, 1, 1, 1, cispi(-2/length(x))) setup=(x = complex.(randn(256)); y = similar(x))
  1.208 μs (0 allocations: 0 bytes)

@andreasnoack andreasnoack requested a review from dannys4 March 25, 2025 13:47
Copy link
Collaborator

@dannys4 dannys4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, kind of weird that plusi is removed and minusi is kept, as you multiply it by the negative when calculating xoe_m_xoo and ỹ_koe_m_ỹ_koo, but changing this doesn't substantially change performance as measured on my personal machine, so I don't think it matters too much.

I also looked into the removal of the @muladd. Since most of the muladds are int-ops, the compiler should be able to optimize this no problem. The fp muladds that were in 220-228, I would hope, would give better performance (without fastmath, the compiler shouldn't be able to reorder your operations to perform these FMAs), but it seems this is not the case according to the experiments.

None of these are actionable items, it's just premature optimizations on my part, and I wanted to put this in writing for "posterity".

@dannys4 dannys4 added this pull request to the merge queue Mar 25, 2025
Merged via the queue into main with commit 7107d46 Mar 25, 2025
4 checks passed
@andreasnoack andreasnoack deleted the an/fasterpow4 branch March 26, 2025 05:24
@andreasnoack
Copy link
Member Author

To be honest, kind of weird that plusi is removed and minusi is kept, as you multiply it by the negative when calculating xoe_m_xoo and ỹ_koe_m_ỹ_koo, but changing this doesn't substantially change performance as measured on my personal machine, so I don't think it matters too much.

Indeed, but I initially got the sign wrong and then added the - afterwards.

Regarding the @muladd then it generally seems to be a mixed back when I profile on my machine. For some of the kernels it seems to help and not for others. I also noticed that some vector instructions sometimes appear without the muladd. These things might also be machine dependent. I ended up browsing the FFTW paper and noticed that they gave up on this approach to speeding things up (with SIMD) and instead relied on SIMD by splitting the complex FFT into to real FFTs. It would require some work here for us to mimic that but it is useful to know that we shouldn't expect easy wins from fancy instructions. Most of the gain will probably come from implementing more and larger kernel sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants