Address test issue in QueryPhaseTimeoutTests #124327

javanna · 2025-03-07T14:20:57Z

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier.

Closes #124140

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140

elasticsearchmachine · 2025-03-07T14:21:27Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

cbuescher · 2025-03-07T16:28:09Z

server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

        assumeTrue("Test requires more than one segment", reader.leaves().size() > 1);
        int size = randomBoolean() ? 0 : randomIntBetween(100, 500);
-        scorerTimeoutTest(size, context -> {
+        scorerSupplierTimeoutTest(size, context -> {


I've been reading TimeoutQuery and MatchAllWeight defined in this test class but I'm still not sure how the function passed in triggers the timeout. At least that's what I'm supposing its doing given its name. Maybe its possible to explain this for future readers somehow?

I think I found something in ContextIndexSearcher#getTimeoutCheck that pulls things from the search context that we override in "createSearchContextWithTimeout", is that the mechanism? Not that obvious unfortunately if it is.

I thought the new javadocs would help explaining this.

Test that a timeout is appropriately handled when the (exitable) directory reader raises it while loading terms enum as the scorer is retrieved from the scorer supplier.

The idea of these tests is that a timeout can be raised from different places: either because we wrap the bulk scorer with a cancellable one, or via the exitable directory reader that wraps the original reader. Because a timeout can be raised from different places, it's easy to miss and to forget handling the corresponding exception. The reason why the timeout is thrown is because we flip shouldTimeout in the search context. I am adding a bit more comments.

cbuescher

Thanks for looking into this test. I took a first pass and left mostly questions and suggestions to maybe improve some documentation as I find this test hard to understand.
Another question because I saw the backport labels: from investigating the issue in #124140 I thought we might have introduced a change somewhere with Lucene 10.1, but now it sounds like the test assumptions might have been wrong even before that? Why does the repro line from the issue then clearly reproduce the failure after the Lucene 10.1 update but not before, do you have any theory on that? Or am I missunderstanding something?

cbuescher · 2025-03-07T16:28:49Z

server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

                assertTrue(context.queryResult().searchTimedOut());
                int firstSegmentMaxDoc = reader.leaves().get(0).reader().maxDoc();
-                assertEquals(Math.min(2048, firstSegmentMaxDoc), context.queryResult().topDocs().topDocs.totalHits.value());
+                // we are artificially raising the timeout when pulling the scorer supplier.


same here: I've been trying to grok this but am most likely missing where the timeout is happening and why.

cbuescher · 2025-03-07T16:35:53Z

server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

        assumeTrue("Test requires more than one segment", reader.leaves().size() > 1);
        int size = randomBoolean() ? 0 : randomIntBetween(100, 500);
-        scorerTimeoutTest(size, context -> {
+        scorerSupplierTimeoutTest(size, context -> {


I think I found something in ContextIndexSearcher#getTimeoutCheck that pulls things from the search context that we override in "createSearchContextWithTimeout", is that the mechanism? Not that obvious unfortunately if it is.

cbuescher · 2025-03-07T16:56:41Z

server/src/test/java/org/elasticsearch/search/query/QueryPhaseTimeoutTests.java

+                        ScorerSupplier scorerSupplier = super.scorerSupplier(context);
+                        return new ScorerSupplier() {
+                            @Override
+                            public Scorer get(long leadCost) throws IOException {


It took me a bit to see the difference between this and the previous override of "ScorerSupplier scorerSupplier(LeafReaderContext context) earlier in this test (L192), maybe its worth adding some more comments around whats different here and why this deserves a second pair of tests.

can you make a suggestion? I am not sure what I can do to clarify this.

Two things that would have helped me understand this test quicker:

understand better how the timeout is actually triggered. I think I now understand that the TimeoutQuery.shouldTimeout flag controls this via the modified TestSearchContext that depending on the flags value returns 0/1 in the overwritten "getRelativeTimeInMillis". Mentioning that right next to the flag or maybe somewhere in "createSearchContextWithTimeout" would have helped me find my way quicker I think

realizing quicker that the getting the ScoreSupplier in "newMatchAllScorerSupplierTimeoutQuery" triggers the timeout vs. getting the Scorer in "newMatchAllScorerGetTimeoutQuery". Its obvious after you understand the trigger a bit and looked hard enough, but method comments on both of these helpers that point out the difference would have helped a lot. e.g.

// create a TimeoutQuery that triggers the timeout when pulling the scorer supplier after the first segment. // Note this is different from "newMatchAllScorerGetTimeoutQuery" where we trigger the timeout when we // get the Scorer from the ScoreSupplier private static TimeoutQuery newMatchAllScorerSupplierTimeoutQuery( [...]

and vice-versa. I might have this still wrong because I'm not 100% sure about the details here.

I also just realized the inline comments in the second parts "scorerGetTimeoutTest()" and "scorerSupplierTimeoutTest()" are similar, I think the former would benefit from a slight adjustment to underline the differences as well.

I thought this was made clear by the javadocs already:

as the scorer is retrieved from the scorer supplier

as opposed to

as the scorer supplier is requested

methods are also named that way to suggest the difference around pulling the supplier as opposed to getting the scorer from the supplier. I totally see that this is tricky.

javanna · 2025-03-10T08:05:49Z

I thought we might have introduced a change somewhere with Lucene 10.1, but now it sounds like the test assumptions might have been wrong even before that? Why does the repro line from the issue then clearly reproduce the failure after the Lucene 10.1 update but not before, do you have any theory on that? Or am I missunderstanding something?

Correct, this has been a bug in this test since day 1. I am not sure why it has not reproduced earlier to be honest. I don't see anything that has changed with Lucene 10.0 or 10.1, but the fact that a certain seed does not reproduce across Lucene 10 and Lucene 9.12 is not shocking, because the sequence of calls to randomize things certainly changed in LuceneTestCase.

javanna · 2025-03-10T18:59:43Z

Hey @cbuescher I addressed your comments, I went ahead and merged this otherwise I'd have gotten new merge conflicts on the muted-tests file that would require another merge, another test run etc.

elasticsearchmachine · 2025-03-10T19:00:17Z

💔 Backport failed

Status	Branch	Result
❌	9.0	Commit could not be cherrypicked due to conflicts
❌	8.18	Commit could not be cherrypicked due to conflicts
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 124327

This was added as a result of merging #124327, via a bad merge. Fix that with this commit.

javanna · 2025-03-10T20:00:42Z

muted-tests.yml

-  issue: https://github.com/elastic/elasticsearch/issues/124141
+- class: org.elasticsearch.compute.data.BlockMultiValuedTests
+  method: testToMask {elementType=BOOLEAN nullAllowed=true}
+  issue: https://github.com/elastic/elasticsearch/issues/124165


This was caused by a bad merge, I fixed that directly on main with dfe2b24

javanna · 2025-03-10T20:06:51Z

Correct, this has been a bug in this test since day 1. I am not sure why it has not reproduced earlier to be honest. I don't see anything that has changed with Lucene 10.0 or 10.1, but the fact that a certain seed does not reproduce across Lucene 10 and Lucene 9.12 is not shocking, because the sequence of calls to randomize things certainly changed in LuceneTestCase.

Mystery solved, it's the test has changed with the Lucene 10 upgrade, see 88b3516#diff-ff445eb9c0ccebe705452387b4f946dd3675db92b1ec513e4a6c2da9e17e67bd from the lucene_snapshot branch. The test was not previously testing timeouts when the scorer supplier is retrieved for each leaf context, but rather when the bulk scorer or the scorer were. It's only the former that needs a different assertion. That means this fix does not need backporting to 8.x.

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140

cbuescher · 2025-03-10T20:23:20Z

I addressed your comments, I went ahead and merged this

Yes, thanks for adding some more comments, appreciate it.

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes #124140

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

javanna added >test Issues or PRs that are addressing/adding tests :Search Foundations/Search Catch all for Search Foundations v9.0.0 v8.18.1 v8.19.0 v9.1.0 labels Mar 7, 2025

javanna requested a review from cbuescher March 7, 2025 14:20

elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Mar 7, 2025

[CI] Auto commit changes from spotless

6bcc1b1

cbuescher reviewed Mar 7, 2025

View reviewed changes

javanna added 4 commits March 10, 2025 11:55

iter

6db9028

iter

0da5e10

Merge branch 'main' into test/query_phase_timeout_tests_num_docs

c9dd3b8

spotless

8de7a80

javanna force-pushed the test/query_phase_timeout_tests_num_docs branch from 95c51aa to 8de7a80 Compare March 10, 2025 15:59

javanna added the auto-backport Automatically create backport pull requests when merged label Mar 10, 2025

javanna merged commit 22d7f39 into elastic:main Mar 10, 2025
17 checks passed

javanna deleted the test/query_phase_timeout_tests_num_docs branch March 10, 2025 18:58

elasticsearchmachine added the backport pending label Mar 10, 2025

javanna added a commit that referenced this pull request Mar 10, 2025

[TEST] Remove BlockMultiValuedTests from muted tests list

dfe2b24

This was added as a result of merging #124327, via a bad merge. Fix that with this commit.

javanna commented Mar 10, 2025

View reviewed changes

javanna removed backport pending v8.18.1 labels Mar 10, 2025

javanna removed the v8.19.0 label Mar 10, 2025

georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Mar 11, 2025

[TEST] Remove BlockMultiValuedTests from muted tests list

4721d1c

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

albertzaharovits pushed a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 13, 2025

[TEST] Remove BlockMultiValuedTests from muted tests list

1abd303

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025

[TEST] Remove BlockMultiValuedTests from muted tests list

49fd11f

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

Address test issue in QueryPhaseTimeoutTests #124327

Address test issue in QueryPhaseTimeoutTests #124327

Uh oh!

Conversation

javanna commented Mar 7, 2025

Uh oh!

elasticsearchmachine commented Mar 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

javanna commented Mar 10, 2025

Uh oh!

elasticsearchmachine commented Mar 10, 2025

💔 Backport failed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbuescher commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

javanna commented Mar 10, 2025 •

edited

Loading

javanna commented Mar 10, 2025 •

edited

Loading

cbuescher commented Mar 10, 2025 •

edited

Loading