-
Dear .NET Community and Maintainers, I am reaching out to seek your expertise and insights regarding a performance challenge we are facing with our ASP.NET Core 6.0 application, which is currently under significant load. Our backend service is processing billions of requests per hour, and despite our server farm's capacity, we are striving to optimize our backend code for maximum performance and minimal latency. We have observed that a single instance of our application generates up to 2GB of allocations per second. Despite our efforts to minimize allocations, a substantial portion remains unavoidable. This leads to our primary concern: the Garbage Collector (GC) heuristics appear to struggle under such intense load, particularly with the volume of items to manage and the resulting heap fragmentation. Our monitoring indicates that the GC does not trigger for longer than one second, suggesting a possible timeout mechanism that we have not found documented. When memory consumption reaches its maximum, we experience a "threshold effect," where the GC can no longer keep up, causing severe performance degradation. As a temporary measure, we have resorted to forcing full-compacting garbage collections at regular intervals (hence the spikes on the memory on the graph). While this approach mitigates the issue, it is not a viable long-term solution and is not enough in some situations (for example on the screenshot above) Given this context, we have several questions:
We are open to any suggestions or guidance you can provide. As you can see, outside of the overload situation, the CPU usage is only at about 40%, and we also have quite a margin on the network stack. It seems the GC is the bottleneck currently, so our goal is to find a way to fine-tune the garbage collection process to ensure stability and performance so that we can further increase service throughput at low latency. Thank you for your time and assistance! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
I've read most of @Maoni0 memdoc but the issue here is not that there are "Too many pauses, ie, too many GCs" or "Long individual pauses" but instead that the GC is not aggressive enough and can no longer catch up when the heap size becomes too large and too fragmented (btw swap is disabled on the Linux server used for the screenshots. I am also wondering why it didn't go OOM at some point) |
Beta Was this translation helpful? Give feedback.
-
there's no such thing.
if a GC is not a full GC, by design it's not going to collect the full heap. and a GC is not necessarily compacting but it does compact, it's not going to only do part the job it set out to do. in other words, if it's a gen1 GC it will collect the whole gen0/gen1. the best way to illustrate what you are seeing is to share a top level GC trace https://github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md#how-to-collect-top-level-gc-metrics |
Beta Was this translation helpful? Give feedback.
-
Simply install 1TB SSD set on your dev machine set all of it as virtual memory, and disable GC during code execution until they are done solves the problem |
Beta Was this translation helpful? Give feedback.
first of all, I don't see any kind of "1s threshold effect". each GC's start and end are recorded in the trace (if you open it in PerfView's GCStats view you'll see that the PauseStart for each GC (along with Pause MSec column that tells you how long this GC pauses the managed threads for). there's no GCs with more than 1s inbetween them. most of them are 200ms or less.
the big problem I see is kind of the opposite of what you described - is that GCs happened too often which causes a very high % time in GC. the reason for this is you are operating near 85% memory load and at that point GC starts to tighten up the gen0 allocation budget. it looks like you have 16GB memory on the machine? t…