Fix regression due to memory allocations. #831
Conversation
|
is the first plot on the cluster machine, could you also post the one on your laptop machine, your laptop machine does not have the organge one, right? |
bench.py takes too much time on my laptop but this is a summary of what I see. It takes tens of ms to do malloc but not 100s of ms. This is using fftw. |
|
@mreineck I added you since this sort-of undoes something you changed. |
|
I'll check tomorrow! Independent of all this: if Could you please paste the benchmark script? |
|
Unfortunately the |
|
The script I used is here: Line 98 in 560bce9 (With some changes I'll push tomorrow) and in particular is that case |
|
I think I can explain the recent overhead in It's not that I don't like the approach, which is pretty cool ... but this adds (in my opinion) an unnecessarily large maintenance burden in relation to the gains. |
|
We may be able to save some more time by switching from |
|
It's very machine dependant. Alternatively, we could switch to a better alllocator like rpmalloc to avoid maintenance |
ahbarnett
left a comment
There was a problem hiding this comment.
Hi Marco, I'm really glad you took the initiative to measure the regression, figure out the cause, and look into a possible solution. The history is that I originally allocated fwBatch in the plan stage, then Martin moved it to the execute in 2.5.0, but I'm confused why having FFTW plan allocate is slower than in 2.2.0 (didn't this do the same?).
Why can't we go back to 2.2.0 style allocation? (this used more RAM than some users liked, eg the t1+t2 plans each used RAM).
Ie, why can't we just undo the thing that caused the regression?
I thought we were going to discuss having an opts switch for where allocations were done, if Martin's alloc-in-exec turned out to cause slowdowns? (I'm not sure how this would interact with the execute_adjoint, which is now a feature we have to maintain).
Like Martin, I am worried about introducing such low-level platform-specific code into FINUFFT - it has to be maintained for the rest of time, even when platforms change and update. That is a pain, and not many people can do it or understand it. Is there no simpler way to pin such memory? (eg xsimd is maintained and tested by other people, so I'm fine using it... it is not our job long-term).
So, I think we all need to summarize the state of affairs here and discuss as a team before moving ahead... otherwise we'll keep adding more and more complicated Marco code to the project. I want the code to stay as simple as possible while still being somewhat close to best performance.
| @@ -0,0 +1,75 @@ | |||
| #include <finufft.h> | |||
There was a problem hiding this comment.
What does this new CI tester do? Documentation at the top of the code is needed.
| include: | ||
| - { os: ubuntu-22.04, toolchain: gcc-13 } | ||
| - { os: macos-14, toolchain: llvm } | ||
| - { os: ubuntu-22.04, toolchain: gcc-13, sanitizer: ON } |
There was a problem hiding this comment.
Is this part of a different PR? Else what is its connection to the PR?
| // Note: spreadinterp.cpp compilation time grows with the gap between these bounds... | ||
| inline constexpr int min_nc_given_ns(int ns) { | ||
| return std::max(common::MIN_NC, ns - 4); // note must stay in bounds from constants.h | ||
| return (std::max)(common::MIN_NC, ns - 4); // note must stay in bounds from constants.h |
There was a problem hiding this comment.
why the parens here? std::max doesn't usually need (std::max). A code comment is needed
include/finufft/memory.hpp
Outdated
| #define WIN32_LEAN_AND_MEAN | ||
| #endif | ||
| #ifndef NOMINMAX | ||
| #define NOMINMAX |
| @@ -0,0 +1,136 @@ | |||
| #pragma once | |||
|
|
|||
| // Cross-platform RAII wrapper for large temporary buffers. | |||
There was a problem hiding this comment.
This file scares me - how are we going to maintain this as part of FINUFFT as platforms change? Is there no such service offered in a standard C++ library that someone else supports? (like xsimd). I also thought you said it was going to be "10 lines" :)
|
Let me try to disentangle the situation a little. It is complicated, but I think it's not as bad as it looks... There are two aspects to the slowdown:
However in my opinion this is not necessary, since the overhead I could measure was only 5% in a case that's practically pure FFT and no spread/interpolation. I would consider that acceptable. Going back to "the plan holds onto the buffers" is something I would only do if there is hard evidence that this is required after the mitigations above have been implemented. |
|
Hi team, Thanks for the feedback, this discussion is the reason I opened the PR. I would say that a 5% is acceptable for now and it can be recovered by better tuning sigma; smaller FFT -> smaller allocation. xsimd is a SIMD wrapper not a memory management library so it is not the right place for this. We could use allocators like RPmalloc but I do not want to bring an extra dependency for 10 lines of code (it is the ifdef and comments that blow it up... as well as c++ RAII boilerplate). I think this class can live for example in POET. The API it uses is either posix or posix-like. AFAIK it has not changed in the last 15 years at least, I used the linux only version of class for 10ish years. They added more options to give more control but they did not break old code. So I am not worried about maintaining it more that it is generally useful and should live somewhere else. I do not like the idea of pre-allocating the FFT scratch as if all the lib pre allocate all the scratches very quickly we will run out of memory. Also, pre allocating while maintaining thread safety of execute and const correctness requires static thead_local scratches that then somehow destroy has to clean? None of this is worth the effort. I'd rather keep thread safety and const correctness. So, most of the code in this PR is for testing const correctness and thread safety. I suggest we merge it as it is a good idea to test these assertions. Then for the scratch we for now leave the allocation as-is and if this scratch is merged in POET we use it from there. This way if the scratch breaks in future kernel releases reverting to a aligned vector is a small change. |
|
OK, thanks for the discussion. It's good to know this new memory.hpp has been stable for 10-15 yrs, and yes I think it should sit somewhere else if it's generally useful, and we make that a new header-only dependency. |
b2d3af2 to
d360f7f
Compare
Use mmap+MADV_FREE for persistent fwBatch buffer in execute, making concurrent execute calls thread-safe without repeated allocation. Includes lazy fwBatch allocation in makeplan and Windows min/max macro collision fixes.
48be036 to
02de82d
Compare


I run some benchmarks recently and found some (hefty) regressions in a couple of cases:


It is possible to see that 2.5.0 sometimes is killed by memory allocation and the checkout on the right recovers all the performance.
I added a new class finufft::ReclaimableMemory which reserves the memory does not allocate it so the plan remains small (overhead is some pointers).
I also added a test for the class.q
There is now a TSAN thread since 2.5.0 execute is thread safe so, I tested also this property.