Skip to content

Conversation

@CodeItAlone
Copy link

PR Description
Issue Reference
Fixes #20413

Description
This PR resolves a significant performance issue where the geotile_grid aggregation causes a 100% CPU stall (unresponsive cluster) when processing complex geo_shape data (specifically LineString) at high precision levels (e.g., precision 29).

The root cause was identified as an unbounded computational loop in the aggregation logic. When a LineString crosses a large number of tiles at high zoom levels, the engine attempts to calculate and collect every single tile intersection on the main execution thread without any safety limits.

Changes
Implemented a Guard Clause: Added a safety threshold in GeoGridAggregator.java to check the valuesCount of a document before processing.

Early Exit Strategy: If a single document's tile count exceeds the defined limit of 10,000, the aggregator now skips that document and moves to the next, preventing the CPU from entering a long-running stall.

Circuit Breaker Logic: This approach acts as a localized circuit breaker to maintain cluster stability during "heavy" geographic queries.

Testing Performed
Unit Tests: Successfully ran modules:geo:test to ensure no regressions in geographic aggregation logic.

Targeted Testing: Verified the fix specifically using GeoTileGridAggregatorTests.

Manual Verification: Confirmed that queries with complex LineStrings that previously stalled the cluster now return successfully without high CPU overhead.

@CodeItAlone CodeItAlone requested a review from a team as a code owner January 22, 2026 18:48
@github-actions github-actions bot added bug Something isn't working good first issue Good for newcomers Search:Aggregations labels Jan 22, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds a guard condition to GeoGridAggregator that skips documents generating more than 10,000 tile values, preventing excessive CPU consumption during geotile_grid aggregations on complex geometries like LineStrings.

Changes

Cohort / File(s) Summary
GeoGrid Aggregator Guard
modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java
Added per-document tile value cap (10,000 threshold) to skip problematic documents during aggregation processing, preventing CPU exhaustion.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: fixing a CPU stall issue in geotile_grid aggregation, which directly matches the core problem addressed in the changeset.
Description check ✅ Passed The PR description thoroughly explains the issue, root cause, solution implementation, and testing performed, covering all key information needed to understand the change.
Linked Issues check ✅ Passed The code changes implement the guard clause and early-exit strategy specified in issue #20413 to prevent CPU stalls by skipping documents exceeding 10,000 tile values.
Out of Scope Changes check ✅ Passed The changes are limited to adding a guard clause in GeoGridAggregator.java, which is directly scoped to fixing the identified CPU stall issue without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java`:
- Around line 104-108: The code currently silently returns in GeoGridAggregator
when a document exceeds MAX_TILES_PER_DOCUMENT; restore the commented
logger.warn call and add a class-level logger (e.g., private static final Logger
logger = LogManager.getLogger(GeoGridAggregator.class)) so skipped documents are
recorded (include doc, name, and valuesCount in the message), and additionally
increment a skipped-documents counter field (e.g., skippedDocs or
skippedTilesCount) inside the same branch; finally expose that counter in the
aggregation metadata or response header from the aggregator result creation path
so clients can detect that documents were omitted.
🧹 Nitpick comments (2)
modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java (2)

97-98: Move constant to class level for clarity and convention.

Declaring MAX_TILES_PER_DOCUMENT inside the anonymous inner class works but is unconventional. Moving it to the class level as a private static final field improves discoverability and follows standard Java conventions for constants.

Suggested refactor

Add at class level (after line 60):

private static final int MAX_TILES_PER_DOCUMENT = 10000;

Then remove lines 97-98 from the anonymous class.


97-98: Consider making the threshold configurable.

The 10,000 tile limit is a reasonable default, but different deployments may have varying tolerance for CPU usage vs. data completeness. Consider exposing this as a cluster setting or aggregation parameter in a follow-up, allowing operators to tune the threshold for their specific use cases.

Also applies to: 104-108

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 967c809 and e43842c.

📒 Files selected for processing (1)
  • modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (1)
modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java (1)

100-124: The guard clause effectively addresses the CPU stall issue.

The early-exit strategy prevents unbounded tile processing, which solves the reported cluster stability problem. The trade-off of potentially incomplete results vs. cluster availability is reasonable for this edge case with extreme geo_shape complexity.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 104 to 112
if (valuesCount > MAX_TILES_PER_DOCUMENT) {
// Log a warning so the user knows why data is missing
// logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent data omission may mislead users with incomplete results.

Skipping documents without any indication to the user means aggregation results will be incomplete without any warning. Users relying on these results won't know data is missing, which could lead to incorrect conclusions.

Consider:

  1. Enable the logging (uncomment line 106) so operators can at least see warnings in logs
  2. Add a response header or metadata to indicate documents were skipped
  3. Track a metric for observability (e.g., count of skipped documents)

At minimum, the commented-out logging should be enabled rather than left as dead code.

Proposed fix: Enable logging
             if (valuesCount > MAX_TILES_PER_DOCUMENT) {
-                // Log a warning so the user knows why data is missing
-                // logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount);
+                logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name(), valuesCount);
                 return;
             }

Note: You'll need to add a logger field at the class level:

private static final Logger logger = LogManager.getLogger(GeoGridAggregator.class);
🤖 Prompt for AI Agents
In
`@modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java`
around lines 104 - 108, The code currently silently returns in GeoGridAggregator
when a document exceeds MAX_TILES_PER_DOCUMENT; restore the commented
logger.warn call and add a class-level logger (e.g., private static final Logger
logger = LogManager.getLogger(GeoGridAggregator.class)) so skipped documents are
recorded (include doc, name, and valuesCount in the message), and additionally
increment a skipped-documents counter field (e.g., skippedDocs or
skippedTilesCount) inside the same branch; finally expose that counter in the
aggregation metadata or response header from the aggregator result creation path
so clients can detect that documents were omitted.

@CodeItAlone CodeItAlone force-pushed the fix/geotile-grid-cpu-stall-20413 branch from e43842c to 087b711 Compare January 22, 2026 19:23
@github-actions
Copy link
Contributor

❌ Gradle check result for 087b711: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@CodeItAlone CodeItAlone force-pushed the fix/geotile-grid-cpu-stall-20413 branch from 087b711 to f8113fa Compare January 22, 2026 20:10
@github-actions
Copy link
Contributor

✅ Gradle check result for f8113fa: SUCCESS

@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.33%. Comparing base (be7b387) to head (f8113fa).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
...aggregations/bucket/geogrid/GeoGridAggregator.java 40.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20461      +/-   ##
============================================
+ Coverage     73.28%   73.33%   +0.05%     
- Complexity    71825    71925     +100     
============================================
  Files          5793     5793              
  Lines        328844   328851       +7     
  Branches      47343    47344       +1     
============================================
+ Hits         240978   241150     +172     
+ Misses        68571    68369     -202     
- Partials      19295    19332      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@andrross andrross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue (#20413) shows this issue can happen with a single document with a 3 coordinate line string. What about that case makes this so complex? (I'll note that I'm not an expert on the geo functionality)

The original issue also shows that the stuck query is not respecting the timeout/cancellation. That definitely appears to be a bug. If these queries can be long running, then they need to be cancel-able. Have you looked into fixing that?


if (valuesCount > MAX_TILES_PER_DOCUMENT) {
// Log a warning so the user knows why data is missing
logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user is unlikely to see the server-side logs. Instead they will see their query complete with no indication that they are getting incomplete/incorrect results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working good first issue Good for newcomers Search:Aggregations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] geotile_grid aggregation on LineString maxes out CPU and stalls cluster

2 participants