Skip to content

Conversation

@javanna
Copy link
Member

@javanna javanna commented Mar 7, 2025

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier.

Closes #124140

The test that is verifying timeout handling when pulling a scorer supplier is
going to always score the entire segment. The test needs to be adjusted
accordingly. While at it, I added docs and clarified the tests a bit, as well
as added a couple new tests that cover for timeout handling when retrieving a
scorer from the scorer supplier.

Closes elastic#124140
@javanna javanna added >test Issues or PRs that are addressing/adding tests :Search Foundations/Search Catch all for Search Foundations v9.0.0 v8.18.1 v8.19.0 v9.1.0 labels Mar 7, 2025
@javanna javanna requested a review from cbuescher March 7, 2025 14:20
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Mar 7, 2025
assumeTrue("Test requires more than one segment", reader.leaves().size() > 1);
int size = randomBoolean() ? 0 : randomIntBetween(100, 500);
scorerTimeoutTest(size, context -> {
scorerSupplierTimeoutTest(size, context -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been reading TimeoutQuery and MatchAllWeight defined in this test class but I'm still not sure how the function passed in triggers the timeout. At least that's what I'm supposing its doing given its name. Maybe its possible to explain this for future readers somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found something in ContextIndexSearcher#getTimeoutCheck that pulls things from the search context that we override in "createSearchContextWithTimeout", is that the mechanism? Not that obvious unfortunately if it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the new javadocs would help explaining this.

Test that a timeout is appropriately handled when the (exitable) directory reader raises it while loading terms enum as the scorer is retrieved from the scorer supplier.

The idea of these tests is that a timeout can be raised from different places: either because we wrap the bulk scorer with a cancellable one, or via the exitable directory reader that wraps the original reader. Because a timeout can be raised from different places, it's easy to miss and to forget handling the corresponding exception. The reason why the timeout is thrown is because we flip shouldTimeout in the search context. I am adding a bit more comments.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this test. I took a first pass and left mostly questions and suggestions to maybe improve some documentation as I find this test hard to understand.
Another question because I saw the backport labels: from investigating the issue in #124140 I thought we might have introduced a change somewhere with Lucene 10.1, but now it sounds like the test assumptions might have been wrong even before that? Why does the repro line from the issue then clearly reproduce the failure after the Lucene 10.1 update but not before, do you have any theory on that? Or am I missunderstanding something?

assertTrue(context.queryResult().searchTimedOut());
int firstSegmentMaxDoc = reader.leaves().get(0).reader().maxDoc();
assertEquals(Math.min(2048, firstSegmentMaxDoc), context.queryResult().topDocs().topDocs.totalHits.value());
// we are artificially raising the timeout when pulling the scorer supplier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: I've been trying to grok this but am most likely missing where the timeout is happening and why.

assumeTrue("Test requires more than one segment", reader.leaves().size() > 1);
int size = randomBoolean() ? 0 : randomIntBetween(100, 500);
scorerTimeoutTest(size, context -> {
scorerSupplierTimeoutTest(size, context -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found something in ContextIndexSearcher#getTimeoutCheck that pulls things from the search context that we override in "createSearchContextWithTimeout", is that the mechanism? Not that obvious unfortunately if it is.

ScorerSupplier scorerSupplier = super.scorerSupplier(context);
return new ScorerSupplier() {
@Override
public Scorer get(long leadCost) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit to see the difference between this and the previous override of "ScorerSupplier scorerSupplier(LeafReaderContext context) earlier in this test (L192), maybe its worth adding some more comments around whats different here and why this deserves a second pair of tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a suggestion? I am not sure what I can do to clarify this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things that would have helped me understand this test quicker:

  • understand better how the timeout is actually triggered. I think I now understand that the TimeoutQuery.shouldTimeout flag controls this via the modified TestSearchContext that depending on the flags value returns 0/1 in the overwritten "getRelativeTimeInMillis". Mentioning that right next to the flag or maybe somewhere in "createSearchContextWithTimeout" would have helped me find my way quicker I think
  • realizing quicker that the getting the ScoreSupplier in "newMatchAllScorerSupplierTimeoutQuery" triggers the timeout vs. getting the Scorer in "newMatchAllScorerGetTimeoutQuery". Its obvious after you understand the trigger a bit and looked hard enough, but method comments on both of these helpers that point out the difference would have helped a lot. e.g.
// create a TimeoutQuery that triggers the timeout when pulling the scorer supplier after the first segment.
// Note this is different from "newMatchAllScorerGetTimeoutQuery" where we trigger the timeout when we
// get the Scorer from the ScoreSupplier
private static TimeoutQuery newMatchAllScorerSupplierTimeoutQuery(
[...]

and vice-versa. I might have this still wrong because I'm not 100% sure about the details here.

I also just realized the inline comments in the second parts "scorerGetTimeoutTest()" and "scorerSupplierTimeoutTest()" are similar, I think the former would benefit from a slight adjustment to underline the differences as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was made clear by the javadocs already:

as the scorer is retrieved from the scorer supplier

as opposed to

as the scorer supplier is requested

methods are also named that way to suggest the difference around pulling the supplier as opposed to getting the scorer from the supplier. I totally see that this is tricky.

@javanna
Copy link
Member Author

javanna commented Mar 10, 2025

I thought we might have introduced a change somewhere with Lucene 10.1, but now it sounds like the test assumptions might have been wrong even before that? Why does the repro line from the issue then clearly reproduce the failure after the Lucene 10.1 update but not before, do you have any theory on that? Or am I missunderstanding something?

Correct, this has been a bug in this test since day 1. I am not sure why it has not reproduced earlier to be honest. I don't see anything that has changed with Lucene 10.0 or 10.1, but the fact that a certain seed does not reproduce across Lucene 10 and Lucene 9.12 is not shocking, because the sequence of calls to randomize things certainly changed in LuceneTestCase.

@javanna javanna force-pushed the test/query_phase_timeout_tests_num_docs branch from 95c51aa to 8de7a80 Compare March 10, 2025 15:59
@javanna javanna added the auto-backport Automatically create backport pull requests when merged label Mar 10, 2025
@javanna javanna merged commit 22d7f39 into elastic:main Mar 10, 2025
17 checks passed
@javanna javanna deleted the test/query_phase_timeout_tests_num_docs branch March 10, 2025 18:58
@javanna
Copy link
Member Author

javanna commented Mar 10, 2025

Hey @cbuescher I addressed your comments, I went ahead and merged this otherwise I'd have gotten new merge conflicts on the muted-tests file that would require another merge, another test run etc.

@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.0 Commit could not be cherrypicked due to conflicts
8.18 Commit could not be cherrypicked due to conflicts
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 124327

javanna added a commit that referenced this pull request Mar 10, 2025
This was added as a result of merging #124327, via a bad merge. Fix that with this commit.
issue: https://github.com/elastic/elasticsearch/issues/124141
- class: org.elasticsearch.compute.data.BlockMultiValuedTests
method: testToMask {elementType=BOOLEAN nullAllowed=true}
issue: https://github.com/elastic/elasticsearch/issues/124165
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was caused by a bad merge, I fixed that directly on main with dfe2b24

@javanna
Copy link
Member Author

javanna commented Mar 10, 2025

Correct, this has been a bug in this test since day 1. I am not sure why it has not reproduced earlier to be honest. I don't see anything that has changed with Lucene 10.0 or 10.1, but the fact that a certain seed does not reproduce across Lucene 10 and Lucene 9.12 is not shocking, because the sequence of calls to randomize things certainly changed in LuceneTestCase.

Mystery solved, it's the test has changed with the Lucene 10 upgrade, see 88b3516#diff-ff445eb9c0ccebe705452387b4f946dd3675db92b1ec513e4a6c2da9e17e67bd from the lucene_snapshot branch. The test was not previously testing timeouts when the scorer supplier is retrieved for each leaf context, but rather when the bulk scorer or the scorer were. It's only the former that needs a different assertion. That means this fix does not need backporting to 8.x.

@javanna javanna removed the v8.19.0 label Mar 10, 2025
javanna added a commit to javanna/elasticsearch that referenced this pull request Mar 10, 2025
The test that is verifying timeout handling when pulling a scorer supplier is
going to always score the entire segment. The test needs to be adjusted
accordingly. While at it, I added docs and clarified the tests a bit, as well
as added a couple new tests that cover for timeout handling when retrieving a
scorer from the scorer supplier.

Closes elastic#124140
@cbuescher
Copy link
Member

cbuescher commented Mar 10, 2025

I addressed your comments, I went ahead and merged this

Yes, thanks for adding some more comments, appreciate it.

elasticsearchmachine pushed a commit that referenced this pull request Mar 10, 2025
The test that is verifying timeout handling when pulling a scorer supplier is
going to always score the entire segment. The test needs to be adjusted
accordingly. While at it, I added docs and clarified the tests a bit, as well
as added a couple new tests that cover for timeout handling when retrieving a
scorer from the scorer supplier.

Closes #124140
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Mar 11, 2025
The test that is verifying timeout handling when pulling a scorer supplier is
going to always score the entire segment. The test needs to be adjusted
accordingly. While at it, I added docs and clarified the tests a bit, as well
as added a couple new tests that cover for timeout handling when retrieving a
scorer from the scorer supplier.

Closes elastic#124140

* [CI] Auto commit changes from spotless

* iter

* iter

* spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Mar 11, 2025
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
albertzaharovits pushed a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 13, 2025
The test that is verifying timeout handling when pulling a scorer supplier is
going to always score the entire segment. The test needs to be adjusted
accordingly. While at it, I added docs and clarified the tests a bit, as well
as added a couple new tests that cover for timeout handling when retrieving a
scorer from the scorer supplier.

Closes elastic#124140

* [CI] Auto commit changes from spotless

* iter

* iter

* spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
albertzaharovits pushed a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 13, 2025
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025
The test that is verifying timeout handling when pulling a scorer supplier is
going to always score the entire segment. The test needs to be adjusted
accordingly. While at it, I added docs and clarified the tests a bit, as well
as added a couple new tests that cover for timeout handling when retrieving a
scorer from the scorer supplier.

Closes elastic#124140

* [CI] Auto commit changes from spotless

* iter

* iter

* spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test Issues or PRs that are addressing/adding tests v9.0.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] QueryPhaseTimeoutTests testScorerTimeoutPoints failing

3 participants