[BUG] Circuit breaker G1OverLimitStrategy can trap heap above threshold

### Describe the bug

The `G1OverLimitStrategy` allocates humongous objects in a loop to trigger GC when heap exceeds the circuit breaker threshold (Humongous allocations can trigger GC when IHOP threshold is exceeded [reference](https://docs.oracle.com/en/java/javase/21/gctuning/garbage-first-g1-garbage-collector1.html#GUID-D74F3CC7-CC9F-45B5-B03D-510AEEAC2DAC:~:text=Allocations%20of%20humongous%20objects%20may%20cause%20garbage%20collection%20pauses%20to%20occur%20prematurely.%20G1%20checks%20the%20Initiating%20Heap%20Occupancy%20threshold%20at%20every%20humongous%20object%20allocation%20and%20may%20force%20an%20initial%20mark%20young%20collection%20immediately%2C%20if%20current%20occupancy%20exceeds%20that%20threshold.)). However, when the breaker limit is low, this loop can allocate too much too fast, keeping Old Gen memory elevated instead of helping it recover.

We observed a production incident where a node's heap increased from 39.5GB to 93.9GB and **did not recover for 30-60 minutes**, even though the memory-intensive workload had completed within seconds. JFR shows most humongous allocations after the spike come from G1OverLimitStrategy.

**JVM settings**

```
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=30
-XX:G1ReservePercent=25
Heap: 125GB
G1 Region Size: 32MB
```

After raising `indices.breaker.total.limit` from 70% to 85%, the problem never recurred. 

This works because:
- After spike to 92GB, heap is BELOW 106GB (85% of 125GB)
- Circuit breaker over limit strategy doesn't fire
- GC has time to recover without additional pressure

**Relevant code:**

https://github.com/opensearch-project/OpenSearch/blob/f6c78d7bc3ebae8c955b3a60ef5f686e9a88de50/server/src/main/java/org/opensearch/indices/breaker/HierarchyCircuitBreakerService.java#L604-L619

```java
int allocationCount = Math.toIntExact((maxHeap - memoryUsed.baseUsage) / g1RegionSize + 1);
int allocationSize = (int) (g1RegionSize >> 1);  // 16MB humongous objects
for (; allocationIndex < allocationCount; ++allocationIndex) {
    // ... break conditions ...
    localBlackHole += new byte[allocationSize].hashCode();
}
```

When the breaker limit is low (e.g., 70%), the threshold is crossed earlier, so `memoryUsed.baseUsage` is lower when the strategy triggers. This results in a large `allocationCount`. For example, on a 125GB heap triggered at 88GB: `(125-88) / 0.032 + 1 = 1157 iterations × 16MB = ~18GB` of potential humongous allocations.

When the heap is fragmented (evacuation failures), the loop allocates faster than GC can reclaim. This creates a feedback loop where the strategy meant to help actually prevents recovery.

## Proposed fix

Cap `allocationCount` to a reasonable maximum:

```java
private static final int MAX_ALLOCATION_ATTEMPTS = 64;  // ~1GB max per invocation
// ...
int allocationCount = Math.min(
    Math.toIntExact((maxHeap - memoryUsed.baseUsage) / g1RegionSize + 1),
    MAX_ALLOCATION_ATTEMPTS
);
```

This limits each invocation to ~1GB instead of potentially 18GB, reducing the risk of the strategy worsening heap pressure.

### Additional Details

### Related component

Search:Resiliency

### Affected versions

All versions using `G1OverLimitStrategy` with G1GC.


	int allocationCount = Math.toIntExact((maxHeap - memoryUsed.baseUsage) / g1RegionSize + 1);
	// allocations of half-region size becomes single humongous alloc, thus taking up a full region.
	int allocationSize = (int) (g1RegionSize >> 1);
	long maxUsageObserved = memoryUsed.baseUsage;
	for (; allocationIndex < allocationCount; ++allocationIndex) {
	long current = currentMemoryUsageSupplier.getAsLong();
	if (current >= maxUsageObserved) {
	maxUsageObserved = current;
	} else {
	// we observed a memory drop, so some GC must have occurred
	break;
	}
	if (initialCollectionCount != gcCountSupplier.getAsLong()) {
	break;
	}
	localBlackHole += new byte[allocationSize].hashCode();

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Circuit breaker G1OverLimitStrategy can trap heap above threshold #20420

Describe the bug

Proposed fix

Additional Details

Related component

Affected versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Circuit breaker G1OverLimitStrategy can trap heap above threshold #20420

Description

Describe the bug

Proposed fix

Additional Details

Related component

Affected versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions