Skip to content

fcmp++: daemon sets max arenas glibc will use#228

Open
j-berman wants to merge 2 commits intoseraphis-migration:fcmp++-stagefrom
j-berman:mallopt
Open

fcmp++: daemon sets max arenas glibc will use#228
j-berman wants to merge 2 commits intoseraphis-migration:fcmp++-stagefrom
j-berman:mallopt

Conversation

@j-berman
Copy link
Collaborator

@j-berman j-berman commented Nov 8, 2025

This setting ensures the daemon won't use more memory than expected when batch verifying FCMP++ txs. This is the crucial setting necessary for @SNeedlewoods to avoid consistent OOM's when syncing as noted here.

What was happening: @SNeedlewoods' 2 thread machine with 8GB RAM must have been using more than 10 arenas, which would mean the allocator was not releasing over 8GB of memory back to the OS, even though the memory used for batch verifying FCMP++ txs is freed properly.

This PR ensures a 2 thread Linux machine using glibc's allocator would use a max of 2 arenas, ensuring that more memory doesn't get allocated and cached than expected.

Hopefully the comments in the PR / links are also helpful in explaining the underlying reasoning.

@jeffro256
Copy link
Collaborator

I thought the high memory usage was inside the Rust code. Does changing glibc parameters actually affect Rust memory allocation?

@j-berman
Copy link
Collaborator Author

j-berman commented Nov 9, 2025

I thought the high memory usage was inside the Rust code

So the Rust side is doing the allocations, but I believe the issue is also that because the calls from C++ are multithreaded, that the allocator on the Rust side is using multiple arenas. If the C++ side was not multithreaded, I would not expect this change to make a difference.

Does changing glibc parameters actually affect Rust memory allocation?

The Rust side uses the system allocator, which is glibc's in @SNeedlewoods' case and (the expected default case for most linux systems). Thus, by default this call also affects the Rust side.

Any system using a different memory allocator on the Rust side is not expected to be affected by this same issue.

If the Rust side ends up using a different memory allocator, then whatever decisions that memory allocator makes could resurface the issue on that system.

@j-berman
Copy link
Collaborator Author

j-berman commented Nov 9, 2025

There is also a case to do this on the C++ side to avoid needing another dependency to call glibc's mallopt on the Rust side

@jeffro256
Copy link
Collaborator

So is the allocation that the Rust side is doing a small number of very large chunks, which is why each arena is getting outsized? If so, perhaps this approach might be better suited: pop-os/cosmic-bg#73. It ostensibly sets a maximum value for arena usage after which large allocations use mmap and drop the value immediately on a call to free().

@jeffro256
Copy link
Collaborator

jeffro256 commented Nov 9, 2025

Without having done any testing myself, and without any real evidence to back my opinion up yet, it feels wrong to decrease the number of arenas from the default. I fear that it may have negative performance impacts for everything else that isn't FCMP++ allocations due to a higher rate of contention. It's honestly might be fine on a 2-thread system since if any thread is looking for an arena to use, the max number of other threads holding arenas is 1 so each thread will on average have to only try 0.5 times to get an arena. However, on a Ryzen Threadripper 3900X with 128 processing threads, this change might actually become a bottleneck.

@ComputeryPony
Copy link

Perhaps it would be better to switch to just using mmap/VirtualAlloc for those big allocations in the FCMP++ code (with a fallback to standard allocations on unsupported platforms). That way there's less chance of unintended performance side effects on existing C++ code and having to constantly keep this change in mind when attempting to optimize memory usage in the future.

@j-berman
Copy link
Collaborator Author

I will look into both the above ideas some more (I did explore setting a lower threshold in my investigation, but will take another pass at it). My thoughts right now..

The Rust side is doing a large number of small allocations, not a small number of large allocations. That will make the above ideas a bit trickier to implement effectively than anticipated. See this idea by @kayabaNerve to use one large allocation for all the inner vecs, which would make it simpler.

I think the perf concerns seem overstated/misunderstood. This PR is setting max arenas equal to n threads, which the man pages basically even recommends to boost performance here:

When employing newer glibc versions, applications may in
some cases exhibit high contention when accessing arenas.
In these cases, it may be beneficial to increase
M_ARENA_MAX to match the number of threads. This is
similar in behavior to strategies taken by tcmalloc and
jemalloc (e.g., per-thread allocation pools).

I think there may be some valid concern this solution isn't portable, and that other systems may have a similar issue depending on their allocator's behavior.

@kayabaNerve
Copy link

I also suggested the idea of using a single allocation per-thread for the Rust code, either static dynamic, and writing a bespoke allocator over increments of it per batch verification. That way the system only sees n large allocations.

Without using another allocator, without writing one, without going through the fcmp++ codebase for each allocation, tuning is the available option.

One may make sure Rust is affected by having Rust explicitly use libc's allocator and call mallopt on its end. The Rust-maintained libc bindings shouldn't be a concern.

@j-berman
Copy link
Collaborator Author

One may make sure Rust is affected by having Rust explicitly use libc's allocator and call mallopt on its end

Personally I think this was overkill compared to just tuning the setting on the C++ side for a few reasons:

  • If glibc isn't available on the system, the Rust side won't be able to change M_ARENA_MAX anyway.
  • We don't know if another libc impl will have the issue or not, so we're not necessarily in a better position by forcing libc's allocator when glibc isn't available.
  • If glibc isn't the default system allocator, I don't expect glibc to be available on that system out of the box.
  • Leaning on the assumption that Rust will use the default system allocator seems sane.

I think above is reasoning enough not to add another dependency on the Rust side, even if Rust-maintained, and add the additional overhead of calling it over the FFI and modifying system allocator on the Rust side.

@j-berman
Copy link
Collaborator Author

Unfortunately setting even MALLOC_MMAP_THRESHOLD_=128 didn't impact the increasing RSS apparently caused by arenas not cleaning up unused memory. It also slows down verify considerably.

You can observe increasing memory usage by observing top while running the batch_verify_from_file test from #224.

@j-berman j-berman marked this pull request as draft November 13, 2025 20:01
@j-berman
Copy link
Collaborator Author

This change may not be necessary thanks to this drastic improvement to the FCMP++ lib's memory usage: kayabaNerve/monero-oxide#4

Will follow up.

@j-berman j-berman marked this pull request as ready for review November 19, 2025 02:40
@j-berman
Copy link
Collaborator Author

This PR is still necessary for systems using glibc. Unfortunately we received a report of an OOM on a machine with 4 cores and 8gb of RAM with the improved FCMP++ lib linked above, and without this change. That OOM is unfortunately unsurprising because by default, glibc can use up to n threads * 8 arenas. Thus, this PR is still necessary for systems using glibc.

I included further rationale in the NWLB channel here

@j-berman j-berman mentioned this pull request Jan 8, 2026
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants