Fix regression due to memory allocations. by DiamonDinoia · Pull Request #831 · flatironinstitute/finufft

DiamonDinoia · 2026-03-12T19:56:03Z

I run some benchmarks recently and found some (hefty) regressions in a couple of cases:

It is possible to see that 2.5.0 sometimes is killed by memory allocation and the checkout on the right recovers all the performance.

I added a new class finufft::ReclaimableMemory which reserves the memory does not allocate it so the plan remains small (overhead is some pointers).

I also added a test for the class.q
There is now a TSAN thread since 2.5.0 execute is thread safe so, I tested also this property.

lu1and10 · 2026-03-12T20:04:00Z

is the first plot on the cluster machine, could you also post the one on your laptop machine, your laptop machine does not have the organge one, right?

DiamonDinoia · 2026-03-12T21:52:47Z

Metric	v2.2.0	v2.3.0	v2.4.0	v2.5.0	HEAD
Makeplan (ms)	5	3	58	63	5
Setpts (ms)	77	79	57	67	35
Execute (ms)	945	1040	893	950	914
Amortized (ms)	1027	1122	1008	1080	954
Speedup vs v2.2.0	1.000x	0.915x	1.019x	0.951x	1.077x

bench.py takes too much time on my laptop but this is a summary of what I see. It takes tens of ms to do malloc but not 100s of ms. This is using fftw.

DiamonDinoia · 2026-03-12T22:26:14Z

@mreineck I added you since this sort-of undoes something you changed.

mreineck · 2026-03-12T22:54:06Z

I'll check tomorrow!

Independent of all this: if malloc or other allocation routines have large overhead, there are probably easier ways to improve things than introducing complicated internal memory management. It can be done by adjusting the behavior of the allocator via a couple of environment variables. I can provide examples if that is an acceptable solution.

Could you please paste the benchmark script?

mreineck · 2026-03-13T00:08:03Z

Unfortunately the M value is cut off in the plot above ... I assume this is a very "sparse" test case (M<<N1*N2*N3)?

DiamonDinoia · 2026-03-13T00:53:59Z

The script I used is here:

finufft/perftest/bench.py

Line 98 in 560bce9

Params("d", 192, 192, 128, 1, 0, 1e7, 1e-7),

(With some changes I'll push tomorrow) and in particular is that case

mreineck · 2026-03-13T07:54:12Z

I think I can explain the recent overhead in makeplan: at some point we switched from raw malloced buffers to vectors, and vectors are zero-initialized. That takes a lot of time, especiallly since FFTW in ESTIMATE mode doesn't touch the buffer at all. With Marco's recent fix to the planning stage, all the overhead in makeplan should already be gone, without the need for reclaimable memory etc.

It's not that I don't like the approach, which is pretty cool ... but this adds (in my opinion) an unnecessarily large maintenance burden in relation to the gains.

mreineck · 2026-03-13T08:24:20Z

These are the results I get if I include current master in the benchmarks. I think the loss of 5% in an extreme corner case should be OK...

mreineck · 2026-03-13T08:28:54Z

We may be able to save some more time by switching from vector types to some sort of raw vectors that don't initialize their memory on construction. Not sure if that's worth it though.

DiamonDinoia · 2026-03-13T13:36:43Z

It's very machine dependant.
I might move the reclaimable buffer in poet as I think is generally useful across projects. That would justify the maintenance.

Alternatively, we could switch to a better alllocator like rpmalloc to avoid maintenance

ahbarnett

Hi Marco, I'm really glad you took the initiative to measure the regression, figure out the cause, and look into a possible solution. The history is that I originally allocated fwBatch in the plan stage, then Martin moved it to the execute in 2.5.0, but I'm confused why having FFTW plan allocate is slower than in 2.2.0 (didn't this do the same?).
Why can't we go back to 2.2.0 style allocation? (this used more RAM than some users liked, eg the t1+t2 plans each used RAM).
Ie, why can't we just undo the thing that caused the regression?

I thought we were going to discuss having an opts switch for where allocations were done, if Martin's alloc-in-exec turned out to cause slowdowns? (I'm not sure how this would interact with the execute_adjoint, which is now a feature we have to maintain).

Like Martin, I am worried about introducing such low-level platform-specific code into FINUFFT - it has to be maintained for the rest of time, even when platforms change and update. That is a pain, and not many people can do it or understand it. Is there no simpler way to pin such memory? (eg xsimd is maintained and tested by other people, so I'm fine using it... it is not our job long-term).

So, I think we all need to summarize the state of affairs here and discuss as a team before moving ahead... otherwise we'll keep adding more and more complicated Marco code to the project. I want the code to stay as simple as possible while still being somewhat close to best performance.

ahbarnett · 2026-03-13T18:53:25Z

test/threadsafe_execute.cpp

@@ -0,0 +1,75 @@
+#include <finufft.h>


What does this new CI tester do? Documentation at the top of the code is needed.

ahbarnett · 2026-03-13T18:54:32Z

.github/workflows/cmake_sanitizers.yml

        include:
-          - { os: ubuntu-22.04, toolchain: gcc-13 }
-          - { os: macos-14, toolchain: llvm }
+          - { os: ubuntu-22.04, toolchain: gcc-13, sanitizer: ON }


Is this part of a different PR? Else what is its connection to the PR?

ahbarnett · 2026-03-13T18:57:16Z

include/finufft_common/kernel.h

 // Note: spreadinterp.cpp compilation time grows with the gap between these bounds...
 inline constexpr int min_nc_given_ns(int ns) {
-  return std::max(common::MIN_NC, ns - 4); // note must stay in bounds from constants.h
+  return (std::max)(common::MIN_NC, ns - 4); // note must stay in bounds from constants.h


why the parens here? std::max doesn't usually need (std::max). A code comment is needed

ahbarnett · 2026-03-13T18:58:35Z

include/finufft/memory.hpp

+#define WIN32_LEAN_AND_MEAN
+#endif
+#ifndef NOMINMAX
+#define NOMINMAX


what is this? docs...

ahbarnett · 2026-03-13T19:01:08Z

include/finufft/memory.hpp

@@ -0,0 +1,136 @@
+#pragma once
+
+// Cross-platform RAII wrapper for large temporary buffers.


This file scares me - how are we going to maintain this as part of FINUFFT as platforms change? Is there no such service offered in a standard C++ library that someone else supports? (like xsimd). I also thought you said it was going to be "10 lines" :)

mreineck · 2026-03-13T22:22:58Z

Let me try to disentangle the situation a little. It is complicated, but I think it's not as bad as it looks...

There are two aspects to the slowdown:

my patch shifted memory allocation to the plan execution stage. That means that for every execution, memory must be allocated - and (due to my sloppiness) I also allocated and deallocated the buffer during the planning stage, when that was actually not necessary (i.e. when FFTW was used in ESTIMATE mode or when ducc was used). This latter part has already been fixed by Marco in workflow draft #629.
As part of my patch we also switched to xsimd vectors instead of raw malloc'ed memory, which is more in line with the C+ philosophy that no resource should ever be left uninitialized. This causes an additional slowdown, which could also be fixed without a large effort ... all we need to do is to fall back to fftw_malloc when using FFTW, or plain malloc when using ducc.

However in my opinion this is not necessary, since the overhead I could measure was only 5% in a case that's practically pure FFT and no spread/interpolation. I would consider that acceptable.

Going back to "the plan holds onto the buffers" is something I would only do if there is hard evidence that this is required after the mitigations above have been implemented.

DiamonDinoia · 2026-03-14T17:14:54Z

Hi team,

Thanks for the feedback, this discussion is the reason I opened the PR. I would say that a 5% is acceptable for now and it can be recovered by better tuning sigma; smaller FFT -> smaller allocation.

xsimd is a SIMD wrapper not a memory management library so it is not the right place for this. We could use allocators like RPmalloc but I do not want to bring an extra dependency for 10 lines of code (it is the ifdef and comments that blow it up... as well as c++ RAII boilerplate). I think this class can live for example in POET. The API it uses is either posix or posix-like. AFAIK it has not changed in the last 15 years at least, I used the linux only version of class for 10ish years. They added more options to give more control but they did not break old code. So I am not worried about maintaining it more that it is generally useful and should live somewhere else.

I do not like the idea of pre-allocating the FFT scratch as if all the lib pre allocate all the scratches very quickly we will run out of memory. Also, pre allocating while maintaining thread safety of execute and const correctness requires static thead_local scratches that then somehow destroy has to clean? None of this is worth the effort. I'd rather keep thread safety and const correctness.

So, most of the code in this PR is for testing const correctness and thread safety. I suggest we merge it as it is a good idea to test these assertions. Then for the scratch we for now leave the allocation as-is and if this scratch is merged in POET we use it from there. This way if the scratch breaks in future kernel releases reverting to a aligned vector is a small change.

ahbarnett · 2026-03-16T18:27:36Z

OK, thanks for the discussion. It's good to know this new memory.hpp has been stable for 10-15 yrs, and yes I think it should sit somewhere else if it's generally useful, and we make that a new header-only dependency.
I'll be happy to merge, but only after you've inserted some comments as per my review - should take a few minutes. Thanks! Alex

Use mmap+MADV_FREE for persistent fwBatch buffer in execute, making concurrent execute calls thread-safe without repeated allocation. Includes lazy fwBatch allocation in makeplan and Windows min/max macro collision fixes.

DiamonDinoia requested review from ahbarnett and lu1and10 March 12, 2026 19:56

DiamonDinoia requested a review from mreineck March 12, 2026 22:25

ahbarnett reviewed Mar 13, 2026

View reviewed changes

DiamonDinoia mentioned this pull request Mar 16, 2026

test: add threadsafe execute test and sanitizer CI modes #832

Merged

DiamonDinoia closed this Mar 16, 2026

DiamonDinoia force-pushed the feat/mmap-fwbatch-alloc branch from b2d3af2 to d360f7f Compare March 16, 2026 19:50

DiamonDinoia reopened this Mar 16, 2026

feat: mmap-based reclaimable memory for thread-safe execute

02de82d

Use mmap+MADV_FREE for persistent fwBatch buffer in execute, making concurrent execute calls thread-safe without repeated allocation. Includes lazy fwBatch allocation in makeplan and Windows min/max macro collision fixes.

DiamonDinoia force-pushed the feat/mmap-fwbatch-alloc branch from 48be036 to 02de82d Compare March 16, 2026 20:00

		@@ -0,0 +1,136 @@
		#pragma once

		// Cross-platform RAII wrapper for large temporary buffers.

Conversation

DiamonDinoia commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lu1and10 commented Mar 12, 2026

Uh oh!

DiamonDinoia commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamonDinoia commented Mar 12, 2026

Uh oh!

mreineck commented Mar 12, 2026

Uh oh!

mreineck commented Mar 13, 2026

Uh oh!

DiamonDinoia commented Mar 13, 2026

Uh oh!

mreineck commented Mar 13, 2026

Uh oh!

mreineck commented Mar 13, 2026

Uh oh!

mreineck commented Mar 13, 2026

Uh oh!

DiamonDinoia commented Mar 13, 2026

Uh oh!

ahbarnett left a comment

Choose a reason for hiding this comment

Uh oh!

ahbarnett Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ahbarnett Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ahbarnett Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ahbarnett Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ahbarnett Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

mreineck commented Mar 13, 2026

Uh oh!

DiamonDinoia commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahbarnett commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DiamonDinoia commented Mar 12, 2026 •

edited

Loading

DiamonDinoia commented Mar 12, 2026 •

edited

Loading

DiamonDinoia commented Mar 14, 2026 •

edited

Loading