-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Fix CPU stall in geotile_grid aggregation (#20413) #20461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix CPU stall in geotile_grid aggregation (#20413) #20461
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughAdds a guard condition to GeoGridAggregator that skips documents generating more than 10,000 tile values, preventing excessive CPU consumption during geotile_grid aggregations on complex geometries like LineStrings. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java`:
- Around line 104-108: The code currently silently returns in GeoGridAggregator
when a document exceeds MAX_TILES_PER_DOCUMENT; restore the commented
logger.warn call and add a class-level logger (e.g., private static final Logger
logger = LogManager.getLogger(GeoGridAggregator.class)) so skipped documents are
recorded (include doc, name, and valuesCount in the message), and additionally
increment a skipped-documents counter field (e.g., skippedDocs or
skippedTilesCount) inside the same branch; finally expose that counter in the
aggregation metadata or response header from the aggregator result creation path
so clients can detect that documents were omitted.
🧹 Nitpick comments (2)
modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java (2)
97-98: Move constant to class level for clarity and convention.Declaring
MAX_TILES_PER_DOCUMENTinside the anonymous inner class works but is unconventional. Moving it to the class level as aprivate static finalfield improves discoverability and follows standard Java conventions for constants.Suggested refactor
Add at class level (after line 60):
private static final int MAX_TILES_PER_DOCUMENT = 10000;Then remove lines 97-98 from the anonymous class.
97-98: Consider making the threshold configurable.The 10,000 tile limit is a reasonable default, but different deployments may have varying tolerance for CPU usage vs. data completeness. Consider exposing this as a cluster setting or aggregation parameter in a follow-up, allowing operators to tune the threshold for their specific use cases.
Also applies to: 104-108
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
🔇 Additional comments (1)
modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java (1)
100-124: The guard clause effectively addresses the CPU stall issue.The early-exit strategy prevents unbounded tile processing, which solves the reported cluster stability problem. The trade-off of potentially incomplete results vs. cluster availability is reasonable for this edge case with extreme geo_shape complexity.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| if (valuesCount > MAX_TILES_PER_DOCUMENT) { | ||
| // Log a warning so the user knows why data is missing | ||
| // logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silent data omission may mislead users with incomplete results.
Skipping documents without any indication to the user means aggregation results will be incomplete without any warning. Users relying on these results won't know data is missing, which could lead to incorrect conclusions.
Consider:
- Enable the logging (uncomment line 106) so operators can at least see warnings in logs
- Add a response header or metadata to indicate documents were skipped
- Track a metric for observability (e.g., count of skipped documents)
At minimum, the commented-out logging should be enabled rather than left as dead code.
Proposed fix: Enable logging
if (valuesCount > MAX_TILES_PER_DOCUMENT) {
- // Log a warning so the user knows why data is missing
- // logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount);
+ logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name(), valuesCount);
return;
}Note: You'll need to add a logger field at the class level:
private static final Logger logger = LogManager.getLogger(GeoGridAggregator.class);🤖 Prompt for AI Agents
In
`@modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoGridAggregator.java`
around lines 104 - 108, The code currently silently returns in GeoGridAggregator
when a document exceeds MAX_TILES_PER_DOCUMENT; restore the commented
logger.warn call and add a class-level logger (e.g., private static final Logger
logger = LogManager.getLogger(GeoGridAggregator.class)) so skipped documents are
recorded (include doc, name, and valuesCount in the message), and additionally
increment a skipped-documents counter field (e.g., skippedDocs or
skippedTilesCount) inside the same branch; finally expose that counter in the
aggregation metadata or response header from the aggregator result creation path
so clients can detect that documents were omitted.
e43842c to
087b711
Compare
|
❌ Gradle check result for 087b711: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…oc (opensearch-project#20413) Signed-off-by: CodeItAlone <subrato213432@gmail.com>
087b711 to
f8113fa
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20461 +/- ##
============================================
+ Coverage 73.28% 73.33% +0.05%
- Complexity 71825 71925 +100
============================================
Files 5793 5793
Lines 328844 328851 +7
Branches 47343 47344 +1
============================================
+ Hits 240978 241150 +172
+ Misses 68571 68369 -202
- Partials 19295 19332 +37 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
andrross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original issue (#20413) shows this issue can happen with a single document with a 3 coordinate line string. What about that case makes this so complex? (I'll note that I'm not an expert on the geo functionality)
The original issue also shows that the stuck query is not respecting the timeout/cancellation. That definitely appears to be a bug. If these queries can be long running, then they need to be cancel-able. Have you looked into fixing that?
|
|
||
| if (valuesCount > MAX_TILES_PER_DOCUMENT) { | ||
| // Log a warning so the user knows why data is missing | ||
| logger.warn("Skipping doc [{}] in aggregation [{}] due to excessive tiles: [{}]", doc, name, valuesCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user is unlikely to see the server-side logs. Instead they will see their query complete with no indication that they are getting incomplete/incorrect results.
PR Description
Issue Reference
Fixes #20413
Description
This PR resolves a significant performance issue where the geotile_grid aggregation causes a 100% CPU stall (unresponsive cluster) when processing complex geo_shape data (specifically LineString) at high precision levels (e.g., precision 29).
The root cause was identified as an unbounded computational loop in the aggregation logic. When a LineString crosses a large number of tiles at high zoom levels, the engine attempts to calculate and collect every single tile intersection on the main execution thread without any safety limits.
Changes
Implemented a Guard Clause: Added a safety threshold in GeoGridAggregator.java to check the valuesCount of a document before processing.
Early Exit Strategy: If a single document's tile count exceeds the defined limit of 10,000, the aggregator now skips that document and moves to the next, preventing the CPU from entering a long-running stall.
Circuit Breaker Logic: This approach acts as a localized circuit breaker to maintain cluster stability during "heavy" geographic queries.
Testing Performed
Unit Tests: Successfully ran modules:geo:test to ensure no regressions in geographic aggregation logic.
Targeted Testing: Verified the fix specifically using GeoTileGridAggregatorTests.
Manual Verification: Confirmed that queries with complex LineStrings that previously stalled the cluster now return successfully without high CPU overhead.