Fix CPU stall in geotile_grid aggregation (#20413) #20461

CodeItAlone · 2026-01-22T18:48:34Z

PR Description
Issue Reference
Fixes #20413

Description
This PR resolves a significant performance issue where the geotile_grid aggregation causes a 100% CPU stall (unresponsive cluster) when processing complex geo_shape data (specifically LineString) at high precision levels (e.g., precision 29).

The root cause was identified as an unbounded computational loop in the aggregation logic. When a LineString crosses a large number of tiles at high zoom levels, the engine attempts to calculate and collect every single tile intersection on the main execution thread without any safety limits.

Changes
Implemented a Guard Clause: Added a safety threshold in GeoGridAggregator.java to check the valuesCount of a document before processing.

Early Exit Strategy: If a single document's tile count exceeds the defined limit of 10,000, the aggregator now skips that document and moves to the next, preventing the CPU from entering a long-running stall.

Circuit Breaker Logic: This approach acts as a localized circuit breaker to maintain cluster stability during "heavy" geographic queries.

Testing Performed
Unit Tests: Successfully ran modules:geo:test to ensure no regressions in geographic aggregation logic.

Targeted Testing: Verified the fix specifically using GeoTileGridAggregatorTests.

Manual Verification: Confirmed that queries with complex LineStrings that previously stalled the cluster now return successfully without high CPU overhead.

coderabbitai · 2026-01-22T18:49:16Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds a guard condition to GeoGridAggregator that skips documents generating more than 10,000 tile values, preventing excessive CPU consumption during geotile_grid aggregations on complex geometries like LineStrings.

Changes

Cohort / File(s)	Summary
GeoGrid Aggregator Guard `modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java`	Added per-document tile value cap (10,000 threshold) to skip problematic documents during aggregation processing, preventing CPU exhaustion.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: fixing a CPU stall issue in geotile_grid aggregation, which directly matches the core problem addressed in the changeset.
Description check	✅ Passed	The PR description thoroughly explains the issue, root cause, solution implementation, and testing performed, covering all key information needed to understand the change.
Linked Issues check	✅ Passed	The code changes implement the guard clause and early-exit strategy specified in issue `#20413` to prevent CPU stalls by skipping documents exceeding 10,000 tile values.
Out of Scope Changes check	✅ Passed	The changes are limited to adding a guard clause in GeoGridAggregator.java, which is directly scoped to fixing the identified CPU stall issue without introducing unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In
`@modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java`:
- Around line 104-108: The code currently silently returns in GeoGridAggregator
when a document exceeds MAX_TILES_PER_DOCUMENT; restore the commented
logger.warn call and add a class-level logger (e.g., private static final Logger
logger = LogManager.getLogger(GeoGridAggregator.class)) so skipped documents are
recorded (include doc, name, and valuesCount in the message), and additionally
increment a skipped-documents counter field (e.g., skippedDocs or
skippedTilesCount) inside the same branch; finally expose that counter in the
aggregation metadata or response header from the aggregator result creation path
so clients can detect that documents were omitted.

🧹 Nitpick comments (2)

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java (2)
97-98: Move constant to class level for clarity and convention.

Declaring MAX_TILES_PER_DOCUMENT inside the anonymous inner class works but is unconventional. Moving it to the class level as a private static final field improves discoverability and follows standard Java conventions for constants.
Suggested refactor

Add at class level (after line 60):
private static final int MAX_TILES_PER_DOCUMENT = 10000;
Then remove lines 97-98 from the anonymous class.
97-98: Consider making the threshold configurable.

The 10,000 tile limit is a reasonable default, but different deployments may have varying tolerance for CPU usage vs. data completeness. Consider exposing this as a cluster setting or aggregation parameter in a follow-up, allowing operators to tune the threshold for their specific use cases.

Also applies to: 104-108

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 967c809 and e43842c.

📒 Files selected for processing (1)

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: gradle-check

🔇 Additional comments (1)

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java (1)

100-124: The guard clause effectively addresses the CPU stall issue.

The early-exit strategy prevents unbounded tile processing, which solves the reported cluster stability problem. The trade-off of potentially incomplete results vs. cluster availability is reasonable for this edge case with extreme geo_shape complexity.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-22T18:52:55Z

...o/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java

+                    if (valuesCount > MAX_TILES_PER_DOCUMENT) {
+                        // Log a warning so the user knows why data is missing
+                        // logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount);
+                        return;
+                    }


⚠️ Potential issue | 🟠 Major

Silent data omission may mislead users with incomplete results.

Skipping documents without any indication to the user means aggregation results will be incomplete without any warning. Users relying on these results won't know data is missing, which could lead to incorrect conclusions.

Consider:

Enable the logging (uncomment line 106) so operators can at least see warnings in logs

Add a response header or metadata to indicate documents were skipped

Track a metric for observability (e.g., count of skipped documents)

At minimum, the commented-out logging should be enabled rather than left as dead code.

Proposed fix: Enable logging

if (valuesCount > MAX_TILES_PER_DOCUMENT) { - // Log a warning so the user knows why data is missing - // logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount); + logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name(), valuesCount); return; }

Note: You'll need to add a logger field at the class level:

private static final Logger logger = LogManager.getLogger(GeoGridAggregator.class);

🤖 Prompt for AI Agents

In `@modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java` around lines 104 - 108, The code currently silently returns in GeoGridAggregator when a document exceeds MAX_TILES_PER_DOCUMENT; restore the commented logger.warn call and add a class-level logger (e.g., private static final Logger logger = LogManager.getLogger(GeoGridAggregator.class)) so skipped documents are recorded (include doc, name, and valuesCount in the message), and additionally increment a skipped-documents counter field (e.g., skippedDocs or skippedTilesCount) inside the same branch; finally expose that counter in the aggregation metadata or response header from the aggregator result creation path so clients can detect that documents were omitted.

github-actions · 2026-01-22T19:32:13Z

❌ Gradle check result for 087b711: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…oc (opensearch-project#20413) Signed-off-by: CodeItAlone <subrato213432@gmail.com>

github-actions · 2026-01-22T21:36:37Z

✅ Gradle check result for f8113fa: SUCCESS

codecov · 2026-01-22T21:37:03Z

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.33%. Comparing base (be7b387) to head (f8113fa).
⚠️ Report is 17 commits behind head on main.

Files with missing lines	Patch %	Lines
...aggregations/bucket/geogrid/GeoGridAggregator.java	40.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20461      +/-   ##
============================================
+ Coverage     73.28%   73.33%   +0.05%     
- Complexity    71825    71925     +100     
============================================
  Files          5793     5793              
  Lines        328844   328851       +7     
  Branches      47343    47344       +1     
============================================
+ Hits         240978   241150     +172     
+ Misses        68571    68369     -202     
- Partials      19295    19332      +37

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andrross

The original issue (#20413) shows this issue can happen with a single document with a 3 coordinate line string. What about that case makes this so complex? (I'll note that I'm not an expert on the geo functionality)

The original issue also shows that the stuck query is not respecting the timeout/cancellation. That definitely appears to be a bug. If these queries can be long running, then they need to be cancel-able. Have you looked into fixing that?

andrross · 2026-01-26T18:54:02Z

...o/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java


+                    if (valuesCount > MAX_TILES_PER_DOCUMENT) {
+                        // Log a warning so the user knows why data is missing
+                        logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount);


A user is unlikely to see the server-side logs. Instead they will see their query complete with no indication that they are getting incomplete/incorrect results.

CodeItAlone requested a review from a team as a code owner January 22, 2026 18:48

github-actions bot added bug Something isn't working good first issue Good for newcomers Search:Aggregations labels Jan 22, 2026

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

CodeItAlone force-pushed the fix/geotile-grid-cpu-stall-20413 branch from e43842c to 087b711 Compare January 22, 2026 19:23

Fix CPU stall in geotile_grid aggregation by limiting max tiles per d…

f8113fa

…oc (opensearch-project#20413) Signed-off-by: CodeItAlone <subrato213432@gmail.com>

CodeItAlone force-pushed the fix/geotile-grid-cpu-stall-20413 branch from 087b711 to f8113fa Compare January 22, 2026 20:10

andrross reviewed Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CPU stall in geotile_grid aggregation (#20413) #20461

Fix CPU stall in geotile_grid aggregation (#20413) #20461

Uh oh!

CodeItAlone commented Jan 22, 2026

Uh oh!

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

codecov bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

andrross left a comment

Uh oh!

andrross Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix CPU stall in geotile_grid aggregation (#20413) #20461

Are you sure you want to change the base?

Fix CPU stall in geotile_grid aggregation (#20413) #20461

Uh oh!

Conversation

CodeItAlone commented Jan 22, 2026

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

codecov bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andrross left a comment

Choose a reason for hiding this comment

Uh oh!

andrross Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

codecov bot commented Jan 22, 2026 •

edited

Loading