Skip to content

Fix geotile_grid aggregation CPU stall on large geo_shapes#20929

Draft
harshavamsi wants to merge 1 commit intoopensearch-project:mainfrom
harshavamsi:fix/geotile-grid-linestring-cpu-stall
Draft

Fix geotile_grid aggregation CPU stall on large geo_shapes#20929
harshavamsi wants to merge 1 commit intoopensearch-project:mainfrom
harshavamsi:fix/geotile-grid-linestring-cpu-stall

Conversation

@harshavamsi
Copy link
Contributor

No description provided.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2026

PR Reviewer Guide 🔍

(Review updated until commit 3fe7d96)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Fix geotile_grid CPU stall with tile limit and cancellation

Relevant files:

  • server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java
  • modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoTileGridAggregatorFactory.java
  • server/src/test/java/org/opensearch/search/aggregations/bucket/geogrid/GeoTileUtilsTests.java
  • modules/geo/src/test/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoTileGridAggregatorTests.java

Sub-PR theme: Add cooperative cancellation to geohash_grid aggregation

Relevant files:

  • modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/util/GeoShapeHashUtil.java
  • modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoHashGridAggregatorFactory.java

⚡ Recommended focus areas for review

Hard Limit Behavior

The hard limit of MAX_TILES_PER_SHAPE (65536) throws an IllegalArgumentException when exceeded. This is a breaking change for users who previously ran large geo_shape aggregations successfully (even if slowly). Consider whether a warning log + graceful degradation (e.g., skipping the shape or using a coarser bounding-box approximation) would be more appropriate than a hard failure, especially since the error surfaces as an aggregation failure rather than a partial result.

if (potentialTiles > MAX_TILES_PER_SHAPE) {
    throw new IllegalArgumentException(
        String.format(
            Locale.ROOT,
            "Tile aggregation on GeoShape field would generate too many tiles (%d) at precision %d. "
                + "The shape spans %d x-tiles and %d y-tiles. Maximum allowed is %d tiles per shape. "
                + "Consider using a lower precision or a smaller shape.",
            potentialTiles,
            precision,
            xTileRange,
            yTileRange,
            MAX_TILES_PER_SHAPE
        )
    );
}
Cancellation Redundancy

With the hard tile limit in place, the cancellation check inside the nested loop (lines 361-363) can only fire for shapes that pass the MAX_TILES_PER_SHAPE guard (i.e., ≤65536 tiles). At CANCELLATION_CHECK_INTERVAL=1024, the check fires at most ~64 times per shape. This is fine, but the two mechanisms (hard limit + cooperative cancellation) serve overlapping purposes. The cancellation path is now only reachable for shapes within the limit, so its practical value is reduced. Consider documenting this interaction clearly.

int iterationCount = 0;
for (int x = minXTile; x <= maxXTile; x++) {
    for (int y = minYTile; y <= maxYTile; y++) {
        if (++iterationCount % CANCELLATION_CHECK_INTERVAL == 0) {
            checkCancelled.run();
        }
Missing Tile Limit

The GeoShapeHashUtil (geohash_grid path) only adds cooperative cancellation but does NOT add the equivalent hard tile/cell limit that GeoTileUtils received. A large geo_shape at high geohash precision can still iterate over a huge number of cells before the cancellation callback fires (every 1024 iterations). This asymmetry means the geohash_grid aggregation remains vulnerable to the same CPU stall that motivated this fix.

int iterationCount = 0;
while (true) {
    if (++iterationCount % CANCELLATION_CHECK_INTERVAL == 0) {
        checkCancelled.run();
    }
Test Exception Type

The test testLineStringWithLargeBoundingBoxAtHighPrecision expects an IllegalArgumentException. If the aggregation framework wraps the exception before surfacing it (e.g., in a SearchPhaseExecutionException or AggregationExecutionException), the test will fail unexpectedly. Verify that the exception propagates unwrapped through searchAndReduce.

Exception exception = expectThrows(
    IllegalArgumentException.class,
    () -> searchAndReduce(searcher, new MatchAllDocsQuery(), aggBuilder, fieldType)
);
assertTrue("Exception should mention tile limit", exception.getMessage().contains("would generate too many tiles"));

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2026

PR Code Suggestions ✨

Latest suggestions up to 3fe7d96

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add hard limit guard to geohash encoding path

The GeoShapeHashUtil.encodeShape method adds periodic cancellation checks but does
not add the same hard tile-count limit that GeoTileUtils.encodeShape has. Without a
pre-check on the bounding box size, a large geo_shape can still cause a CPU stall in
the geohash path — the cancellation only helps if the task is already cancelled
externally. Consider adding a similar MAX_HASHES_PER_SHAPE guard before the loop,
consistent with the tile implementation.

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/util/GeoShapeHashUtil.java [59-63]

+// Calculate the potential number of hashes to prevent excessive iteration
+// (similar guard as in GeoTileUtils.encodeShape)
+// ... add bounding box size check here before entering the loop ...
+
 int iterationCount = 0;
 while (true) {
     if (++iterationCount % CANCELLATION_CHECK_INTERVAL == 0) {
         checkCancelled.run();
     }
Suggestion importance[1-10]: 6

__

Why: The PR adds a hard tile limit to GeoTileUtils.encodeShape but not to GeoShapeHashUtil.encodeShape, creating an inconsistency where the geohash path remains vulnerable to CPU stalls. The improved_code is only a placeholder comment, but the underlying concern is valid and important for consistency.

Low
Use appropriate exception type for data limit exceeded

Throwing an IllegalArgumentException for a runtime data condition (a document's
shape being too large) is semantically incorrect — this is not an invalid argument
to the method, but rather a data-driven limit being exceeded. This will also surface
as an unhandled exception to the user rather than being treated as a search error.
Consider using a dedicated exception type or returning an empty list with a warning,
depending on the desired behavior.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [341-355]

 if (potentialTiles > MAX_TILES_PER_SHAPE) {
-    throw new IllegalArgumentException(
+    throw new OpenSearchException(
         String.format(
             Locale.ROOT,
             "Tile aggregation on GeoShape field would generate too many tiles (%d) at precision %d. "
                 + "The shape spans %d x-tiles and %d y-tiles. Maximum allowed is %d tiles per shape. "
                 + "Consider using a lower precision or a smaller shape.",
             potentialTiles,
             precision,
             xTileRange,
             yTileRange,
             MAX_TILES_PER_SHAPE
         )
     );
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid semantic concern — IllegalArgumentException implies a bad method argument, not a data-driven limit. However, the test code explicitly expects IllegalArgumentException, so changing the exception type would require updating tests too. The suggestion's improved_code uses OpenSearchException which is a reasonable alternative, but the impact depends on how callers handle this exception.

Low
Avoid swallowing IOException in test method

The outer try-catch for IOException in
testLineStringWithLargeBoundingBoxAtHighPrecision swallows the exception and calls
fail(), but the inner try-with-resources block already handles IOException via the
method signature. This pattern can hide real I/O failures and is inconsistent with
testLineStringWithModerateBoundingBox which properly declares throws IOException.
The test method should declare throws IOException instead.

modules/geo/src/test/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoTileGridAggregatorTests.java [112-114]

-} catch (IOException e) {
-    fail("IOException during test: " + e.getMessage());
+public void testLineStringWithLargeBoundingBoxAtHighPrecision() throws IOException {
+    final String fieldName = "location";
+    Line line = new Line(new double[] { -170.0, 170.0 }, new double[] { -60.0, 60.0 });
+
+    try (Directory dir = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random(), dir)) {
+        Document doc = new Document();
+        doc.add(LatLonShape.createDocValueField(fieldName, GeoShapeUtils.toLuceneLine(line)));
+        writer.addDocument(doc);
+
+        MappedFieldType fieldType = new GeoShapeFieldMapper.GeoShapeFieldType(fieldName);
+
+        GeoTileGridAggregationBuilder aggBuilder = new GeoTileGridAggregationBuilder("geotile_grid");
+        aggBuilder.field(fieldName).precision(15);
+
+        try (IndexReader reader = writer.getReader()) {
+            IndexSearcher searcher = new IndexSearcher(reader);
+
+            Exception exception = expectThrows(
+                IllegalArgumentException.class,
+                () -> searchAndReduce(searcher, new MatchAllDocsQuery(), aggBuilder, fieldType)
+            );
+            assertTrue("Exception should mention tile limit", exception.getMessage().contains("would generate too many tiles"));
+        }
+    }
 }
Suggestion importance[1-10]: 4

__

Why: The inconsistency between testLineStringWithLargeBoundingBoxAtHighPrecision (catching IOException) and testLineStringWithModerateBoundingBox (declaring throws IOException) is a valid style issue. Declaring throws IOException is cleaner and consistent with the other test method.

Low
Possible issue
Guard against negative tile range values

The calculation of xTileRange and yTileRange could produce negative values if
minXTile > maxXTile or minYTile > maxYTile due to edge cases (e.g., shapes crossing
the antimeridian or poles). A negative range would make potentialTiles negative,
bypassing the limit check entirely. Add guards to ensure ranges are non-negative
before computing potentialTiles.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [336-338]

-long xTileRange = (long) maxXTile - minXTile + 1;
-long yTileRange = (long) maxYTile - minYTile + 1;
+long xTileRange = Math.max(0L, (long) maxXTile - minXTile + 1);
+long yTileRange = Math.max(0L, (long) maxYTile - minYTile + 1);
 long potentialTiles = xTileRange * yTileRange;
Suggestion importance[1-10]: 5

__

Why: While edge cases with antimeridian-crossing shapes could theoretically produce unexpected values, the existing getXTile/getYTile methods likely clamp values, making this a low-probability issue. Still, the defensive Math.max(0L, ...) guard is a reasonable safety measure.

Low

Previous suggestions

Suggestions up to commit 3d421f9
CategorySuggestion                                                                                                                                    Impact
General
Avoid failing entire aggregation for oversized shapes

Throwing an IllegalArgumentException for a shape that exceeds the tile limit is a
breaking change in behavior — callers processing many documents will have the entire
aggregation fail rather than just skipping the oversized shape. Consider logging a
warning and returning an empty list (or a sampled subset) instead, so that other
documents in the aggregation are still processed correctly.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [341-355]

 if (potentialTiles > MAX_TILES_PER_SHAPE) {
-    throw new IllegalArgumentException(
-        String.format(
-            Locale.ROOT,
-            "Tile aggregation on GeoShape field would generate too many tiles (%d) at precision %d. "
-                + "The shape spans %d x-tiles and %d y-tiles. Maximum allowed is %d tiles per shape. "
-                + "Consider using a lower precision or a smaller shape.",
-            potentialTiles,
-            precision,
-            xTileRange,
-            yTileRange,
-            MAX_TILES_PER_SHAPE
-        )
-    );
+    // Skip shapes that would generate too many tiles rather than failing the entire aggregation
+    return Collections.emptyList();
 }
Suggestion importance[1-10]: 6

__

Why: Throwing IllegalArgumentException for shapes exceeding the tile limit will fail the entire aggregation rather than gracefully skipping the problematic document. Returning an empty list would be more resilient, though the test in GeoTileGridAggregatorTests explicitly expects the exception, so this is a design trade-off worth considering.

Low
Possible issue
Guard against negative tile range values

The calculation of xTileRange and yTileRange could be negative if minXTile >
maxXTile or minYTile > maxYTile due to edge cases (e.g., shapes crossing the
antimeridian or poles). This would result in a negative potentialTiles value,
bypassing the limit check entirely. Add guards to ensure the ranges are non-negative
before computing potentialTiles.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [336-338]

-long xTileRange = (long) maxXTile - minXTile + 1;
-long yTileRange = (long) maxYTile - minYTile + 1;
+long xTileRange = Math.max(0L, (long) maxXTile - minXTile + 1);
+long yTileRange = Math.max(0L, (long) maxYTile - minYTile + 1);
 long potentialTiles = xTileRange * yTileRange;
Suggestion importance[1-10]: 4

__

Why: While theoretically possible, in practice minXTile should always be <= maxXTile since they're derived from minX and maxX of a bounding rectangle. The Math.max(0L, ...) guard is a defensive improvement but the scenario is unlikely given the bounding rectangle contract.

Low
Prevent integer overflow in iteration counter

The iterationCount variable is an int, which will overflow to negative values after
~2 billion iterations. Once negative, iterationCount % CANCELLATION_CHECK_INTERVAL
will never equal 0 for negative odd multiples, causing cancellation checks to be
skipped. Use a long for iterationCount to avoid this overflow.

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/util/GeoShapeHashUtil.java [59-63]

-int iterationCount = 0;
+long iterationCount = 0;
 while (true) {
     if (++iterationCount % CANCELLATION_CHECK_INTERVAL == 0) {
         checkCancelled.run();
     }
Suggestion importance[1-10]: 3

__

Why: While technically correct that int overflow could cause issues, the MAX_TILES_PER_SHAPE limit (65536) in GeoTileUtils and the bounding box iteration in GeoShapeHashUtil make it practically impossible to reach 2 billion iterations. The suggestion is valid but has negligible real-world impact.

Low
Suggestions up to commit c660272
CategorySuggestion                                                                                                                                    Impact
General
Avoid breaking existing queries with hard exception

Throwing an IllegalArgumentException for a shape that exceeds the tile limit is a
breaking change for existing users who may have valid queries with large shapes at
high precision. This will silently fail aggregations rather than returning partial
or degraded results. Consider using the checkCancelled callback mechanism already in
place, or logging a warning and capping the iteration, rather than throwing a hard
exception that breaks existing functionality.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [341-355]

 if (potentialTiles > MAX_TILES_PER_SHAPE) {
-    throw new IllegalArgumentException(
-        String.format(
-            Locale.ROOT,
-            "Tile aggregation on GeoShape field would generate too many tiles (%d) at precision %d. "
-                + "The shape spans %d x-tiles and %d y-tiles. Maximum allowed is %d tiles per shape. "
-                + "Consider using a lower precision or a smaller shape.",
-            potentialTiles,
-            precision,
-            xTileRange,
-            yTileRange,
-            MAX_TILES_PER_SHAPE
-        )
-    );
+    // Skip encoding this shape silently or log a warning; avoid breaking existing queries
+    return Collections.emptyList();
 }
Suggestion importance[1-10]: 6

__

Why: Throwing IllegalArgumentException for shapes exceeding the tile limit is a breaking change for existing users with valid large-shape queries. The suggestion to return an empty list or use a softer degradation strategy is architecturally sound, though the improved code silently drops results which may also be undesirable. The concern about backward compatibility is legitimate and important.

Low
Possible issue
Guard against negative tile range values

The calculation of xTileRange and yTileRange could produce negative values if
minXTile > maxXTile or minYTile > maxYTile due to edge cases (e.g., shapes crossing
the antimeridian or poles). A negative range would make potentialTiles negative,
bypassing the limit check entirely. Add a guard to ensure ranges are non-negative
before computing potentialTiles.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [336-338]

-long xTileRange = (long) maxXTile - minXTile + 1;
-long yTileRange = (long) maxYTile - minYTile + 1;
+long xTileRange = Math.max(0L, (long) maxXTile - minXTile + 1);
+long yTileRange = Math.max(0L, (long) maxYTile - minYTile + 1);
 long potentialTiles = xTileRange * yTileRange;
Suggestion importance[1-10]: 5

__

Why: While the concern about negative ranges is valid in theory, the getXTile and getYTile methods should return consistent values for valid bounding rectangles. However, edge cases like antimeridian-crossing shapes could potentially cause issues, making this a reasonable defensive guard with moderate impact.

Low
Prevent integer overflow in iteration counter

The iterationCount variable is declared as int, which will overflow to negative
values after ~2 billion iterations. Once negative, iterationCount %
CANCELLATION_CHECK_INTERVAL will never equal 0 for certain values, causing
cancellation checks to be skipped. Declare iterationCount as long to prevent
overflow.

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/util/GeoShapeHashUtil.java [59-63]

-int iterationCount = 0;
+long iterationCount = 0;
 while (true) {
     if (++iterationCount % CANCELLATION_CHECK_INTERVAL == 0) {
         checkCancelled.run();
     }
Suggestion importance[1-10]: 3

__

Why: While technically correct that int can overflow after ~2 billion iterations, in practice the geohash iteration loop would never reach that count due to the nature of geohash grids. The overflow scenario is extremely unlikely in real usage, making this a low-impact theoretical fix.

Low
Suggestions up to commit 27f9bcb
CategorySuggestion                                                                                                                                    Impact
General
Avoid breaking aggregation for oversized shapes

Throwing an IllegalArgumentException for a runtime data condition (a document's
shape being too large) is semantically incorrect — this is not an invalid argument
to the method, but rather a data-driven limit being exceeded. This could also
surface as an unexpected error to users. Consider using a dedicated exception type
or simply returning an empty list with a log warning, to avoid breaking the
aggregation for a single oversized document.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [341-355]

 if (potentialTiles > MAX_TILES_PER_SHAPE) {
-    throw new IllegalArgumentException(
-        String.format(
-            Locale.ROOT,
-            "Tile aggregation on GeoShape field would generate too many tiles (%d) at precision %d. "
-                + "The shape spans %d x-tiles and %d y-tiles. Maximum allowed is %d tiles per shape. "
-                + "Consider using a lower precision or a smaller shape.",
-            potentialTiles,
-            precision,
-            xTileRange,
-            yTileRange,
-            MAX_TILES_PER_SHAPE
-        )
-    );
+    // Skip shapes that would generate too many tiles to prevent CPU stalls.
+    // Log a warning and return an empty list rather than failing the entire aggregation.
+    return Collections.emptyList();
 }
Suggestion importance[1-10]: 6

__

Why: Throwing IllegalArgumentException for a data-driven limit (a document's shape being too large) is semantically incorrect and could break the entire aggregation for a single oversized document. Returning an empty list with a warning would be more appropriate behavior, though the test in the PR explicitly expects this exception, so adopting this suggestion would require test changes too.

Low
Propagate IOException instead of swallowing it

The outer try-catch for IOException swallows the exception and calls fail(), but the
expectThrows block inside already handles the expected exception. If an IOException
occurs during setup (e.g., writer.addDocument), the test will fail with a less
informative message. The method signature should declare throws IOException instead,
allowing the test framework to report the actual cause clearly.

modules/geo/src/test/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoTileGridAggregatorTests.java [112-114]

-} catch (IOException e) {
-    fail("IOException during test: " + e.getMessage());
+public void testLineStringWithLargeBoundingBoxAtHighPrecision() throws IOException {
+    final String fieldName = "location";
+    Line line = new Line(new double[] { -170.0, 170.0 }, new double[] { -60.0, 60.0 });
+
+    try (Directory dir = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random(), dir)) {
+        Document doc = new Document();
+        doc.add(LatLonShape.createDocValueField(fieldName, GeoShapeUtils.toLuceneLine(line)));
+        writer.addDocument(doc);
+
+        MappedFieldType fieldType = new GeoShapeFieldMapper.GeoShapeFieldType(fieldName);
+
+        GeoTileGridAggregationBuilder aggBuilder = new GeoTileGridAggregationBuilder("geotile_grid");
+        aggBuilder.field(fieldName).precision(15);
+
+        try (IndexReader reader = writer.getReader()) {
+            IndexSearcher searcher = new IndexSearcher(reader);
+
+            Exception exception = expectThrows(
+                IllegalArgumentException.class,
+                () -> searchAndReduce(searcher, new MatchAllDocsQuery(), aggBuilder, fieldType)
+            );
+            assertTrue("Exception should mention tile limit", exception.getMessage().contains("would generate too many tiles"));
+        }
+    }
 }
Suggestion importance[1-10]: 5

__

Why: Catching IOException and calling fail() produces less informative error messages compared to declaring throws IOException on the method. The improved code correctly restructures the test to propagate the exception, improving test diagnostics.

Low
Possible issue
Guard against negative tile range values

The calculation of xTileRange and yTileRange could produce negative values if
minXTile > maxXTile or minYTile > maxYTile due to edge cases (e.g., shapes crossing
the antimeridian or poles). This would cause potentialTiles to be negative,
bypassing the limit check entirely. Add guards to ensure the ranges are non-negative
before computing potentialTiles.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [336-338]

-long xTileRange = (long) maxXTile - minXTile + 1;
-long yTileRange = (long) maxYTile - minYTile + 1;
+long xTileRange = Math.max(0L, (long) maxXTile - minXTile + 1);
+long yTileRange = Math.max(0L, (long) maxYTile - minYTile + 1);
 long potentialTiles = xTileRange * yTileRange;
Suggestion importance[1-10]: 5

__

Why: If minXTile > maxXTile or minYTile > maxYTile due to edge cases, potentialTiles could be negative, bypassing the limit check. Using Math.max(0L, ...) is a reasonable defensive guard, though such edge cases may be rare in practice.

Low
Prevent integer overflow in iteration counter

The iterationCount variable is declared as int, which will overflow to negative
values after ~2 billion iterations and cause the modulo check to stop working
correctly. For very large shapes, this could silently disable cancellation checks.
Use a long instead to safely handle large iteration counts.

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/util/GeoShapeHashUtil.java [59-63]

-int iterationCount = 0;
+long iterationCount = 0;
 while (true) {
     if (++iterationCount % CANCELLATION_CHECK_INTERVAL == 0) {
         checkCancelled.run();
     }
Suggestion importance[1-10]: 4

__

Why: Using int for iterationCount could theoretically overflow after ~2 billion iterations, silently disabling cancellation checks. However, given the MAX_TILES_PER_SHAPE limit added in GeoTileUtils, extremely large iteration counts are unlikely in practice, making this a low-priority defensive improvement.

Low
Suggestions up to commit fb92419
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against negative tile range values

The calculation of xTileRange and yTileRange can produce negative values when
minXTile > maxXTile or minYTile > maxYTile (e.g., shapes crossing the antimeridian),
which would make potentialTiles negative and bypass the guard. Add a guard to ensure
the ranges are non-negative before computing the product.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [336-338]

-long xTileRange = (long) maxXTile - minXTile + 1;
-long yTileRange = (long) maxYTile - minYTile + 1;
+long xTileRange = Math.max(0L, (long) maxXTile - minXTile + 1);
+long yTileRange = Math.max(0L, (long) maxYTile - minYTile + 1);
 long potentialTiles = xTileRange * yTileRange;
Suggestion importance[1-10]: 6

__

Why: If minXTile > maxXTile (e.g., shapes crossing the antimeridian), xTileRange would be negative, making potentialTiles negative and bypassing the guard. Using Math.max(0L, ...) is a valid defensive fix, though such edge cases may already be handled upstream.

Low
General
Avoid hard failure for oversized shapes

Throwing an IllegalArgumentException for a shape that exceeds the tile limit will
surface as a hard error to the user for any large geo_shape document, even when the
search is not cancelled. This is a breaking behavioral change — previously large
shapes were processed (slowly). Consider returning a truncated result or logging a
warning instead, or at minimum use a dedicated exception type that can be caught and
handled gracefully upstream.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [341-355]

 if (potentialTiles > MAX_TILES_PER_SHAPE) {
-    throw new IllegalArgumentException(
-        ...
-    );
+    // Log a warning and return an empty list to avoid breaking existing queries
+    // while still preventing CPU stalls
+    return Collections.emptyList();
 }
Suggestion importance[1-10]: 5

__

Why: Throwing IllegalArgumentException for large shapes is a breaking behavioral change that surfaces as a hard error to users. Returning an empty list or a warning would be less disruptive, though the PR intentionally chose the hard-fail approach to prevent CPU stalls. This is a valid design concern but the improved_code changes the behavior significantly.

Low
Improve test exception handling consistency

The outer try-catch for IOException swallows the exception and calls fail(), but the
expectThrows call inside the inner try block already handles the expected exception.
If an unexpected IOException is thrown during setup or teardown, the test will fail
with a less informative message. The method signature should declare throws
IOException and let the framework handle it, consistent with
testLineStringWithModerateBoundingBox.

modules/geo/src/test/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoTileGridAggregatorTests.java [112-114]

-} catch (IOException e) {
-    fail("IOException during test: " + e.getMessage());
+public void testLineStringWithLargeBoundingBoxAtHighPrecision() throws IOException {
+    final String fieldName = "location";
+    Line line = new Line(new double[] { -170.0, 170.0 }, new double[] { -60.0, 60.0 });
+
+    try (Directory dir = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random(), dir)) {
+        Document doc = new Document();
+        doc.add(LatLonShape.createDocValueField(fieldName, GeoShapeUtils.toLuceneLine(line)));
+        writer.addDocument(doc);
+
+        MappedFieldType fieldType = new GeoShapeFieldMapper.GeoShapeFieldType(fieldName);
+
+        GeoTileGridAggregationBuilder aggBuilder = new GeoTileGridAggregationBuilder("geotile_grid");
+        aggBuilder.field(fieldName).precision(15);
+
+        try (IndexReader reader = writer.getReader()) {
+            IndexSearcher searcher = new IndexSearcher(reader);
+
+            Exception exception = expectThrows(
+                IllegalArgumentException.class,
+                () -> searchAndReduce(searcher, new MatchAllDocsQuery(), aggBuilder, fieldType)
+            );
+            assertTrue("Exception should mention tile limit", exception.getMessage().contains("would generate too many tiles"));
+        }
+    }
 }
Suggestion importance[1-10]: 5

__

Why: The outer try-catch for IOException produces less informative failure messages compared to declaring throws IOException. The suggestion correctly aligns the method signature with testLineStringWithModerateBoundingBox for consistency and better error reporting.

Low
Prevent integer overflow in iteration counter

The cancellation check uses iterationCount % CANCELLATION_CHECK_INTERVAL == 0, but
since iterationCount starts at 0 and is pre-incremented, the first check fires at
iteration 1024. However, if the loop runs for more than Integer.MAX_VALUE
iterations, iterationCount will overflow and the modulo check may behave
unexpectedly. Consider using a long for iterationCount to avoid overflow on very
large shapes.

modules/geo/src/main/java/org/opensearch/geo/search/aggregations/bucket/geogrid/util/GeoShapeHashUtil.java [59-63]

-int iterationCount = 0;
+long iterationCount = 0;
 while (true) {
     if (++iterationCount % CANCELLATION_CHECK_INTERVAL == 0) {
         checkCancelled.run();
     }
Suggestion importance[1-10]: 3

__

Why: While technically valid, integer overflow in iterationCount would require over 2 billion iterations, which is extremely unlikely in practice given the MAX_TILES_PER_SHAPE guard in GeoTileUtils. The risk is minimal and the suggestion has low practical impact.

Low
Suggestions up to commit 7a67afe
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against negative tile range values

The calculation of xTileRange and yTileRange can produce negative values when
minXTile > maxXTile or minYTile > maxYTile (e.g., shapes crossing the antimeridian),
which would make potentialTiles negative and bypass the guard. Add a guard to ensure
the ranges are non-negative before multiplying.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [336-338]

-long xTileRange = (long) maxXTile - minXTile + 1;
-long yTileRange = (long) maxYTile - minYTile + 1;
+long xTileRange = Math.max(0L, (long) maxXTile - minXTile + 1);
+long yTileRange = Math.max(0L, (long) maxYTile - minYTile + 1);
 long potentialTiles = xTileRange * yTileRange;
Suggestion importance[1-10]: 6

__

Why: If minXTile > maxXTile (e.g., antimeridian-crossing shapes), xTileRange would be negative, making potentialTiles negative and bypassing the guard. Using Math.max(0L, ...) is a valid defensive fix, though such edge cases may already be handled upstream.

Low
General
Use appropriate exception type for resource limit

Throwing an IllegalArgumentException for a runtime data condition (a shape that is
too large) is semantically incorrect and will surface as a 400 Bad Request to the
user, even though the request itself is valid. A dedicated exception such as
OpenSearchException or a circuit-breaker-style exception would be more appropriate
to signal a resource limit was hit at query time.

server/src/main/java/org/opensearch/search/aggregations/bucket/GeoTileUtils.java [341-355]

 if (potentialTiles > MAX_TILES_PER_SHAPE) {
-    throw new IllegalArgumentException(
+    throw new OpenSearchException(
         String.format(
             Locale.ROOT,
             "Tile aggregation on GeoShape field would generate too many tiles (%d) at precision %d. "
                 + "The shape spans %d x-tiles and %d y-tiles. Maximum allowed is %d tiles per shape. "
                 + "Consider using a lower precision or a smaller shape.",
             potentialTiles,
             precision,
             xTileRange,
             yTileRange,
             MAX_TILES_PER_SHAPE
         )
     );
 }
Suggestion importance[1-10]: 5

__

Why: Throwing IllegalArgumentException for a runtime data condition (shape too large) is semantically misleading and may surface as a 400 to users with valid requests. Using OpenSearchException or a resource-limit exception would be more appropriate, though the test in the PR explicitly expects IllegalArgumentException, so changing this would require test updates too.

Low
Propagate IOException instead of swallowing it

The outer try-catch (IOException) swallows the exception and calls fail(), but the
try-with-resources block already handles IOException via the throws mechanism. This
pattern prevents the test framework from properly reporting the root cause. The
method should declare throws IOException instead, consistent with
testLineStringWithModerateBoundingBox.

modules/geo/src/test/java/org/opensearch/geo/search/aggregations/bucket/geogrid/GeoTileGridAggregatorTests.java [112-114]

-} catch (IOException e) {
-    fail("IOException during test: " + e.getMessage());
+public void testLineStringWithLargeBoundingBoxAtHighPrecision() throws IOException {
+    // ... (remove the outer catch block)
 }
Suggestion importance[1-10]: 5

__

Why: The outer catch (IOException) block swallows the exception and calls fail(), which hides the root cause. Declaring throws IOException (as done in testLineStringWithModerateBoundingBox) is cleaner and consistent with the other test method in the same class.

Low

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 2f10a2c

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 9cd2b04

@github-actions
Copy link
Contributor

❌ Gradle check result for 9cd2b04: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 198502d

@github-actions
Copy link
Contributor

❌ Gradle check result for 198502d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit d530b3d

@github-actions
Copy link
Contributor

❌ Gradle check result for d530b3d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit be97b27

@github-actions
Copy link
Contributor

❌ Gradle check result for be97b27: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 015f624

@github-actions
Copy link
Contributor

❌ Gradle check result for 015f624: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 77c70c4

@github-actions
Copy link
Contributor

❌ Gradle check result for 77c70c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 77c70c4

@github-actions
Copy link
Contributor

❌ Gradle check result for 77c70c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harshavamsi harshavamsi reopened this Mar 20, 2026
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 77c70c4

@github-actions
Copy link
Contributor

❌ Gradle check result for 77c70c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harshavamsi harshavamsi reopened this Mar 20, 2026
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 77c70c4

@github-actions
Copy link
Contributor

❌ Gradle check result for 77c70c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 2e6c034

@github-actions
Copy link
Contributor

❌ Gradle check result for 2e6c034: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 6cb98d1

@github-actions
Copy link
Contributor

❌ Gradle check result for 6cb98d1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 7a67afe

@github-actions
Copy link
Contributor

❌ Gradle check result for 7a67afe: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit fb92419

@github-actions
Copy link
Contributor

❌ Gradle check result for fb92419: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harshavamsi harshavamsi reopened this Mar 20, 2026
@harshavamsi harshavamsi force-pushed the fix/geotile-grid-linestring-cpu-stall branch from fb92419 to 27f9bcb Compare March 20, 2026 17:53
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 27f9bcb

@github-actions
Copy link
Contributor

❌ Gradle check result for 27f9bcb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harshavamsi harshavamsi reopened this Mar 20, 2026
@harshavamsi harshavamsi force-pushed the fix/geotile-grid-linestring-cpu-stall branch from 27f9bcb to c660272 Compare March 20, 2026 19:18
@github-actions
Copy link
Contributor

Persistent review updated to latest commit c660272

@github-actions
Copy link
Contributor

❌ Gradle check result for c660272: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harshavamsi harshavamsi force-pushed the fix/geotile-grid-linestring-cpu-stall branch from c660272 to 3d421f9 Compare March 20, 2026 20:28
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 3d421f9

@github-actions
Copy link
Contributor

❌ Gradle check result for 3d421f9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
@harshavamsi harshavamsi force-pushed the fix/geotile-grid-linestring-cpu-stall branch from 3d421f9 to 3fe7d96 Compare March 20, 2026 21:39
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 3fe7d96

@github-actions
Copy link
Contributor

✅ Gradle check result for 3fe7d96: SUCCESS

@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 55.17241% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.34%. Comparing base (9f5d0e1) to head (3fe7d96).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
...s/bucket/geogrid/GeoHashGridAggregatorFactory.java 0.00% 5 Missing ⚠️
...egations/bucket/geogrid/util/GeoShapeHashUtil.java 0.00% 4 Missing ⚠️
...s/bucket/geogrid/GeoTileGridAggregatorFactory.java 40.00% 3 Missing ⚠️
...earch/search/aggregations/bucket/GeoTileUtils.java 93.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20929      +/-   ##
============================================
+ Coverage     73.30%   73.34%   +0.03%     
- Complexity    72484    72576      +92     
============================================
  Files          5819     5819              
  Lines        331155   331381     +226     
  Branches      47840    47878      +38     
============================================
+ Hits         242769   243054     +285     
+ Misses        68876    68795      -81     
- Partials      19510    19532      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant