fcmp++: daemon sets max arenas glibc will use#228
fcmp++: daemon sets max arenas glibc will use#228j-berman wants to merge 2 commits intoseraphis-migration:fcmp++-stagefrom
Conversation
|
I thought the high memory usage was inside the Rust code. Does changing |
So the Rust side is doing the allocations, but I believe the issue is also that because the calls from C++ are multithreaded, that the allocator on the Rust side is using multiple arenas. If the C++ side was not multithreaded, I would not expect this change to make a difference.
The Rust side uses the system allocator, which is glibc's in @SNeedlewoods' case and (the expected default case for most linux systems). Thus, by default this call also affects the Rust side. Any system using a different memory allocator on the Rust side is not expected to be affected by this same issue. If the Rust side ends up using a different memory allocator, then whatever decisions that memory allocator makes could resurface the issue on that system. |
|
There is also a case to do this on the C++ side to avoid needing another dependency to call glibc's |
|
So is the allocation that the Rust side is doing a small number of very large chunks, which is why each arena is getting outsized? If so, perhaps this approach might be better suited: pop-os/cosmic-bg#73. It ostensibly sets a maximum value for arena usage after which large allocations use |
|
Without having done any testing myself, and without any real evidence to back my opinion up yet, it feels wrong to decrease the number of arenas from the default. I fear that it may have negative performance impacts for everything else that isn't FCMP++ allocations due to a higher rate of contention. It's honestly might be fine on a 2-thread system since if any thread is looking for an arena to use, the max number of other threads holding arenas is 1 so each thread will on average have to only try 0.5 times to get an arena. However, on a Ryzen Threadripper 3900X with 128 processing threads, this change might actually become a bottleneck. |
|
Perhaps it would be better to switch to just using mmap/VirtualAlloc for those big allocations in the FCMP++ code (with a fallback to standard allocations on unsupported platforms). That way there's less chance of unintended performance side effects on existing C++ code and having to constantly keep this change in mind when attempting to optimize memory usage in the future. |
|
I will look into both the above ideas some more (I did explore setting a lower threshold in my investigation, but will take another pass at it). My thoughts right now.. The Rust side is doing a large number of small allocations, not a small number of large allocations. That will make the above ideas a bit trickier to implement effectively than anticipated. See this idea by @kayabaNerve to use one large allocation for all the inner vecs, which would make it simpler. I think the perf concerns seem overstated/misunderstood. This PR is setting max arenas equal to n threads, which the man pages basically even recommends to boost performance here:
I think there may be some valid concern this solution isn't portable, and that other systems may have a similar issue depending on their allocator's behavior. |
|
I also suggested the idea of using a single allocation per-thread for the Rust code, either static dynamic, and writing a bespoke allocator over increments of it per batch verification. That way the system only sees n large allocations. Without using another allocator, without writing one, without going through the fcmp++ codebase for each allocation, tuning is the available option. One may make sure Rust is affected by having Rust explicitly use libc's allocator and call mallopt on its end. The Rust-maintained libc bindings shouldn't be a concern. |
Personally I think this was overkill compared to just tuning the setting on the C++ side for a few reasons:
I think above is reasoning enough not to add another dependency on the Rust side, even if Rust-maintained, and add the additional overhead of calling it over the FFI and modifying system allocator on the Rust side. |
|
Unfortunately setting even You can observe increasing memory usage by observing |
|
This change may not be necessary thanks to this drastic improvement to the FCMP++ lib's memory usage: kayabaNerve/monero-oxide#4 Will follow up. |
|
This PR is still necessary for systems using glibc. Unfortunately we received a report of an OOM on a machine with 4 cores and 8gb of RAM with the improved FCMP++ lib linked above, and without this change. That OOM is unfortunately unsurprising because by default, glibc can use up to I included further rationale in the NWLB channel here |
This setting ensures the daemon won't use more memory than expected when batch verifying FCMP++ txs. This is the crucial setting necessary for @SNeedlewoods to avoid consistent OOM's when syncing as noted here.
What was happening: @SNeedlewoods' 2 thread machine with 8GB RAM must have been using more than 10 arenas, which would mean the allocator was not releasing over 8GB of memory back to the OS, even though the memory used for batch verifying FCMP++ txs is freed properly.
This PR ensures a 2 thread Linux machine using glibc's allocator would use a max of 2 arenas, ensuring that more memory doesn't get allocated and cached than expected.
Hopefully the comments in the PR / links are also helpful in explaining the underlying reasoning.