-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Address test issue in QueryPhaseTimeoutTests #124327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address test issue in QueryPhaseTimeoutTests #124327
Conversation
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
| assumeTrue("Test requires more than one segment", reader.leaves().size() > 1); | ||
| int size = randomBoolean() ? 0 : randomIntBetween(100, 500); | ||
| scorerTimeoutTest(size, context -> { | ||
| scorerSupplierTimeoutTest(size, context -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been reading TimeoutQuery and MatchAllWeight defined in this test class but I'm still not sure how the function passed in triggers the timeout. At least that's what I'm supposing its doing given its name. Maybe its possible to explain this for future readers somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found something in ContextIndexSearcher#getTimeoutCheck that pulls things from the search context that we override in "createSearchContextWithTimeout", is that the mechanism? Not that obvious unfortunately if it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the new javadocs would help explaining this.
Test that a timeout is appropriately handled when the (exitable) directory reader raises it while loading terms enum as the scorer is retrieved from the scorer supplier.
The idea of these tests is that a timeout can be raised from different places: either because we wrap the bulk scorer with a cancellable one, or via the exitable directory reader that wraps the original reader. Because a timeout can be raised from different places, it's easy to miss and to forget handling the corresponding exception. The reason why the timeout is thrown is because we flip shouldTimeout in the search context. I am adding a bit more comments.
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this test. I took a first pass and left mostly questions and suggestions to maybe improve some documentation as I find this test hard to understand.
Another question because I saw the backport labels: from investigating the issue in #124140 I thought we might have introduced a change somewhere with Lucene 10.1, but now it sounds like the test assumptions might have been wrong even before that? Why does the repro line from the issue then clearly reproduce the failure after the Lucene 10.1 update but not before, do you have any theory on that? Or am I missunderstanding something?
| assertTrue(context.queryResult().searchTimedOut()); | ||
| int firstSegmentMaxDoc = reader.leaves().get(0).reader().maxDoc(); | ||
| assertEquals(Math.min(2048, firstSegmentMaxDoc), context.queryResult().topDocs().topDocs.totalHits.value()); | ||
| // we are artificially raising the timeout when pulling the scorer supplier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: I've been trying to grok this but am most likely missing where the timeout is happening and why.
| assumeTrue("Test requires more than one segment", reader.leaves().size() > 1); | ||
| int size = randomBoolean() ? 0 : randomIntBetween(100, 500); | ||
| scorerTimeoutTest(size, context -> { | ||
| scorerSupplierTimeoutTest(size, context -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found something in ContextIndexSearcher#getTimeoutCheck that pulls things from the search context that we override in "createSearchContextWithTimeout", is that the mechanism? Not that obvious unfortunately if it is.
| ScorerSupplier scorerSupplier = super.scorerSupplier(context); | ||
| return new ScorerSupplier() { | ||
| @Override | ||
| public Scorer get(long leadCost) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a bit to see the difference between this and the previous override of "ScorerSupplier scorerSupplier(LeafReaderContext context) earlier in this test (L192), maybe its worth adding some more comments around whats different here and why this deserves a second pair of tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a suggestion? I am not sure what I can do to clarify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things that would have helped me understand this test quicker:
- understand better how the timeout is actually triggered. I think I now understand that the TimeoutQuery.shouldTimeout flag controls this via the modified TestSearchContext that depending on the flags value returns 0/1 in the overwritten "getRelativeTimeInMillis". Mentioning that right next to the flag or maybe somewhere in "createSearchContextWithTimeout" would have helped me find my way quicker I think
- realizing quicker that the getting the ScoreSupplier in "newMatchAllScorerSupplierTimeoutQuery" triggers the timeout vs. getting the Scorer in "newMatchAllScorerGetTimeoutQuery". Its obvious after you understand the trigger a bit and looked hard enough, but method comments on both of these helpers that point out the difference would have helped a lot. e.g.
// create a TimeoutQuery that triggers the timeout when pulling the scorer supplier after the first segment.
// Note this is different from "newMatchAllScorerGetTimeoutQuery" where we trigger the timeout when we
// get the Scorer from the ScoreSupplier
private static TimeoutQuery newMatchAllScorerSupplierTimeoutQuery(
[...]
and vice-versa. I might have this still wrong because I'm not 100% sure about the details here.
I also just realized the inline comments in the second parts "scorerGetTimeoutTest()" and "scorerSupplierTimeoutTest()" are similar, I think the former would benefit from a slight adjustment to underline the differences as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was made clear by the javadocs already:
as the scorer is retrieved from the scorer supplier
as opposed to
as the scorer supplier is requested
methods are also named that way to suggest the difference around pulling the supplier as opposed to getting the scorer from the supplier. I totally see that this is tricky.
Correct, this has been a bug in this test since day 1. I am not sure why it has not reproduced earlier to be honest. I don't see anything that has changed with Lucene 10.0 or 10.1, but the fact that a certain seed does not reproduce across Lucene 10 and Lucene 9.12 is not shocking, because the sequence of calls to randomize things certainly changed in LuceneTestCase. |
95c51aa to
8de7a80
Compare
|
Hey @cbuescher I addressed your comments, I went ahead and merged this otherwise I'd have gotten new merge conflicts on the muted-tests file that would require another merge, another test run etc. |
💔 Backport failed
You can use sqren/backport to manually backport by running |
This was added as a result of merging #124327, via a bad merge. Fix that with this commit.
| issue: https://github.com/elastic/elasticsearch/issues/124141 | ||
| - class: org.elasticsearch.compute.data.BlockMultiValuedTests | ||
| method: testToMask {elementType=BOOLEAN nullAllowed=true} | ||
| issue: https://github.com/elastic/elasticsearch/issues/124165 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was caused by a bad merge, I fixed that directly on main with dfe2b24
Mystery solved, it's the test has changed with the Lucene 10 upgrade, see 88b3516#diff-ff445eb9c0ccebe705452387b4f946dd3675db92b1ec513e4a6c2da9e17e67bd from the lucene_snapshot branch. The test was not previously testing timeouts when the scorer supplier is retrieved for each leaf context, but rather when the bulk scorer or the scorer were. It's only the former that needs a different assertion. That means this fix does not need backporting to 8.x. |
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140
Yes, thanks for adding some more comments, appreciate it. |
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes #124140
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier.
Closes #124140