Skip to content

[BUG] Circuit breaker G1OverLimitStrategy can trap heap above thresholdΒ #20420

@bowenlan-amzn

Description

@bowenlan-amzn

Describe the bug

The G1OverLimitStrategy allocates humongous objects in a loop to trigger GC when heap exceeds the circuit breaker threshold (Humongous allocations can trigger GC when IHOP threshold is exceeded reference). However, when the breaker limit is low, this loop can allocate too much too fast, keeping Old Gen memory elevated instead of helping it recover.

We observed a production incident where a node's heap increased from 39.5GB to 93.9GB and did not recover for 30-60 minutes, even though the memory-intensive workload had completed within seconds. JFR shows most humongous allocations after the spike come from G1OverLimitStrategy.

JVM settings

-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=30
-XX:G1ReservePercent=25
Heap: 125GB
G1 Region Size: 32MB

After raising indices.breaker.total.limit from 70% to 85%, the problem never recurred.

This works because:

  • After spike to 92GB, heap is BELOW 106GB (85% of 125GB)
  • Circuit breaker over limit strategy doesn't fire
  • GC has time to recover without additional pressure

Relevant code:

int allocationCount = Math.toIntExact((maxHeap - memoryUsed.baseUsage) / g1RegionSize + 1);
// allocations of half-region size becomes single humongous alloc, thus taking up a full region.
int allocationSize = (int) (g1RegionSize >> 1);
long maxUsageObserved = memoryUsed.baseUsage;
for (; allocationIndex < allocationCount; ++allocationIndex) {
long current = currentMemoryUsageSupplier.getAsLong();
if (current >= maxUsageObserved) {
maxUsageObserved = current;
} else {
// we observed a memory drop, so some GC must have occurred
break;
}
if (initialCollectionCount != gcCountSupplier.getAsLong()) {
break;
}
localBlackHole += new byte[allocationSize].hashCode();

int allocationCount = Math.toIntExact((maxHeap - memoryUsed.baseUsage) / g1RegionSize + 1);
int allocationSize = (int) (g1RegionSize >> 1);  // 16MB humongous objects
for (; allocationIndex < allocationCount; ++allocationIndex) {
    // ... break conditions ...
    localBlackHole += new byte[allocationSize].hashCode();
}

When the breaker limit is low (e.g., 70%), the threshold is crossed earlier, so memoryUsed.baseUsage is lower when the strategy triggers. This results in a large allocationCount. For example, on a 125GB heap triggered at 88GB: (125-88) / 0.032 + 1 = 1157 iterations Γ— 16MB = ~18GB of potential humongous allocations.

When the heap is fragmented (evacuation failures), the loop allocates faster than GC can reclaim. This creates a feedback loop where the strategy meant to help actually prevents recovery.

Proposed fix

Cap allocationCount to a reasonable maximum:

private static final int MAX_ALLOCATION_ATTEMPTS = 64;  // ~1GB max per invocation
// ...
int allocationCount = Math.min(
    Math.toIntExact((maxHeap - memoryUsed.baseUsage) / g1RegionSize + 1),
    MAX_ALLOCATION_ATTEMPTS
);

This limits each invocation to ~1GB instead of potentially 18GB, reducing the risk of the strategy worsening heap pressure.

Additional Details

Related component

Search:Resiliency

Affected versions

All versions using G1OverLimitStrategy with G1GC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:ResiliencybugSomething isn't workingdiscussIssues intended to help drive brainstorming and decision making

    Type

    No type

    Projects

    Status

    πŸ†• New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions