Skip to content

Fix regression due to memory allocations. #831

Open
DiamonDinoia wants to merge 1 commit intomasterfrom
feat/mmap-fwbatch-alloc
Open

Fix regression due to memory allocations. #831
DiamonDinoia wants to merge 1 commit intomasterfrom
feat/mmap-fwbatch-alloc

Conversation

@DiamonDinoia
Copy link
Collaborator

@DiamonDinoia DiamonDinoia commented Mar 12, 2026

I run some benchmarks recently and found some (hefty) regressions in a couple of cases:
192x192x128-type-2-upsamp2 00-precd-thread0
250x250x250-type-1-upsamp2 00-precd-thread1

It is possible to see that 2.5.0 sometimes is killed by memory allocation and the checkout on the right recovers all the performance.

I added a new class finufft::ReclaimableMemory which reserves the memory does not allocate it so the plan remains small (overhead is some pointers).

I also added a test for the class.q
There is now a TSAN thread since 2.5.0 execute is thread safe so, I tested also this property.

@lu1and10
Copy link
Member

is the first plot on the cluster machine, could you also post the one on your laptop machine, your laptop machine does not have the organge one, right?

@DiamonDinoia
Copy link
Collaborator Author

DiamonDinoia commented Mar 12, 2026

Metric v2.2.0 v2.3.0 v2.4.0 v2.5.0 HEAD
Makeplan (ms) 5 3 58 63 5
Setpts (ms) 77 79 57 67 35
Execute (ms) 945 1040 893 950 914
Amortized (ms) 1027 1122 1008 1080 954
Speedup vs v2.2.0 1.000x 0.915x 1.019x 0.951x 1.077x

bench.py takes too much time on my laptop but this is a summary of what I see. It takes tens of ms to do malloc but not 100s of ms. This is using fftw.

@DiamonDinoia DiamonDinoia requested a review from mreineck March 12, 2026 22:25
@DiamonDinoia
Copy link
Collaborator Author

@mreineck I added you since this sort-of undoes something you changed.

@mreineck
Copy link
Collaborator

I'll check tomorrow!

Independent of all this: if malloc or other allocation routines have large overhead, there are probably easier ways to improve things than introducing complicated internal memory management. It can be done by adjusting the behavior of the allocator via a couple of environment variables. I can provide examples if that is an acceptable solution.

Could you please paste the benchmark script?

@mreineck
Copy link
Collaborator

Unfortunately the M value is cut off in the plot above ... I assume this is a very "sparse" test case (M<<N1*N2*N3)?

@DiamonDinoia
Copy link
Collaborator Author

The script I used is here:

Params("d", 192, 192, 128, 1, 0, 1e7, 1e-7),

(With some changes I'll push tomorrow) and in particular is that case

@mreineck
Copy link
Collaborator

I think I can explain the recent overhead in makeplan: at some point we switched from raw malloced buffers to vectors, and vectors are zero-initialized. That takes a lot of time, especiallly since FFTW in ESTIMATE mode doesn't touch the buffer at all. With Marco's recent fix to the planning stage, all the overhead in makeplan should already be gone, without the need for reclaimable memory etc.

It's not that I don't like the approach, which is pretty cool ... but this adds (in my opinion) an unnecessarily large maintenance burden in relation to the gains.

@mreineck
Copy link
Collaborator

These are the results I get if I include current master in the benchmarks. I think the loss of 5% in an extreme corner case should be OK...
192x192x128-type-2-upsamp2 00-precd-thread0
250x250x250-type-2-upsamp2 00-precd-thread1

@mreineck
Copy link
Collaborator

We may be able to save some more time by switching from vector types to some sort of raw vectors that don't initialize their memory on construction. Not sure if that's worth it though.

@DiamonDinoia
Copy link
Collaborator Author

It's very machine dependant.
I might move the reclaimable buffer in poet as I think is generally useful across projects. That would justify the maintenance.

Alternatively, we could switch to a better alllocator like rpmalloc to avoid maintenance

Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Marco, I'm really glad you took the initiative to measure the regression, figure out the cause, and look into a possible solution. The history is that I originally allocated fwBatch in the plan stage, then Martin moved it to the execute in 2.5.0, but I'm confused why having FFTW plan allocate is slower than in 2.2.0 (didn't this do the same?).
Why can't we go back to 2.2.0 style allocation? (this used more RAM than some users liked, eg the t1+t2 plans each used RAM).
Ie, why can't we just undo the thing that caused the regression?

I thought we were going to discuss having an opts switch for where allocations were done, if Martin's alloc-in-exec turned out to cause slowdowns? (I'm not sure how this would interact with the execute_adjoint, which is now a feature we have to maintain).

Like Martin, I am worried about introducing such low-level platform-specific code into FINUFFT - it has to be maintained for the rest of time, even when platforms change and update. That is a pain, and not many people can do it or understand it. Is there no simpler way to pin such memory? (eg xsimd is maintained and tested by other people, so I'm fine using it... it is not our job long-term).

So, I think we all need to summarize the state of affairs here and discuss as a team before moving ahead... otherwise we'll keep adding more and more complicated Marco code to the project. I want the code to stay as simple as possible while still being somewhat close to best performance.

@@ -0,0 +1,75 @@
#include <finufft.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this new CI tester do? Documentation at the top of the code is needed.

include:
- { os: ubuntu-22.04, toolchain: gcc-13 }
- { os: macos-14, toolchain: llvm }
- { os: ubuntu-22.04, toolchain: gcc-13, sanitizer: ON }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of a different PR? Else what is its connection to the PR?

// Note: spreadinterp.cpp compilation time grows with the gap between these bounds...
inline constexpr int min_nc_given_ns(int ns) {
return std::max(common::MIN_NC, ns - 4); // note must stay in bounds from constants.h
return (std::max)(common::MIN_NC, ns - 4); // note must stay in bounds from constants.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the parens here? std::max doesn't usually need (std::max). A code comment is needed

#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this? docs...

@@ -0,0 +1,136 @@
#pragma once

// Cross-platform RAII wrapper for large temporary buffers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file scares me - how are we going to maintain this as part of FINUFFT as platforms change? Is there no such service offered in a standard C++ library that someone else supports? (like xsimd). I also thought you said it was going to be "10 lines" :)

@mreineck
Copy link
Collaborator

Let me try to disentangle the situation a little. It is complicated, but I think it's not as bad as it looks...

There are two aspects to the slowdown:

  • my patch shifted memory allocation to the plan execution stage. That means that for every execution, memory must be allocated - and (due to my sloppiness) I also allocated and deallocated the buffer during the planning stage, when that was actually not necessary (i.e. when FFTW was used in ESTIMATE mode or when ducc was used). This latter part has already been fixed by Marco in workflow draft #629.

  • As part of my patch we also switched to xsimd vectors instead of raw malloc'ed memory, which is more in line with the C+ philosophy that no resource should ever be left uninitialized. This causes an additional slowdown, which could also be fixed without a large effort ... all we need to do is to fall back to fftw_malloc when using FFTW, or plain malloc when using ducc.

However in my opinion this is not necessary, since the overhead I could measure was only 5% in a case that's practically pure FFT and no spread/interpolation. I would consider that acceptable.

Going back to "the plan holds onto the buffers" is something I would only do if there is hard evidence that this is required after the mitigations above have been implemented.

@DiamonDinoia
Copy link
Collaborator Author

DiamonDinoia commented Mar 14, 2026

Hi team,

Thanks for the feedback, this discussion is the reason I opened the PR. I would say that a 5% is acceptable for now and it can be recovered by better tuning sigma; smaller FFT -> smaller allocation.

xsimd is a SIMD wrapper not a memory management library so it is not the right place for this. We could use allocators like RPmalloc but I do not want to bring an extra dependency for 10 lines of code (it is the ifdef and comments that blow it up... as well as c++ RAII boilerplate). I think this class can live for example in POET. The API it uses is either posix or posix-like. AFAIK it has not changed in the last 15 years at least, I used the linux only version of class for 10ish years. They added more options to give more control but they did not break old code. So I am not worried about maintaining it more that it is generally useful and should live somewhere else.

I do not like the idea of pre-allocating the FFT scratch as if all the lib pre allocate all the scratches very quickly we will run out of memory. Also, pre allocating while maintaining thread safety of execute and const correctness requires static thead_local scratches that then somehow destroy has to clean? None of this is worth the effort. I'd rather keep thread safety and const correctness.

So, most of the code in this PR is for testing const correctness and thread safety. I suggest we merge it as it is a good idea to test these assertions. Then for the scratch we for now leave the allocation as-is and if this scratch is merged in POET we use it from there. This way if the scratch breaks in future kernel releases reverting to a aligned vector is a small change.

@ahbarnett
Copy link
Collaborator

OK, thanks for the discussion. It's good to know this new memory.hpp has been stable for 10-15 yrs, and yes I think it should sit somewhere else if it's generally useful, and we make that a new header-only dependency.
I'll be happy to merge, but only after you've inserted some comments as per my review - should take a few minutes. Thanks! Alex

Use mmap+MADV_FREE for persistent fwBatch buffer in execute,
making concurrent execute calls thread-safe without repeated
allocation. Includes lazy fwBatch allocation in makeplan and
Windows min/max macro collision fixes.
@DiamonDinoia DiamonDinoia force-pushed the feat/mmap-fwbatch-alloc branch from 48be036 to 02de82d Compare March 16, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants