-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Aggs: Fix Filters agg cancellation flaky test #130810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
assertThat(searchRequestFuture.isCancelled(), equalTo(false)); | ||
assertThat(searchRequestFuture.isDone(), equalTo(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message looks better this way IMO, no functional change here
assertBusy(() -> { | ||
assertTrue(searchRequestFuture.isDone()); | ||
assertThat(getSearchTasks(), empty()); | ||
assertThat("Search request didn't finish", searchRequestFuture.isDone(), equalTo(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there an assertTrue
that also takes a String
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that, I see now that you used it to get the hamcrest
error message, and apparently assertTrue(msg, bool)
is from junit
, not hamcrest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. The message with assertTrue/False was just "AssertionError: null", while the equalTo(true) was like "false != true". Which isn't much better, but at least tells you something. Anyway, with the message they should be similar, but I changed them anyway XD
return PauseScriptPlugin.class; | ||
@Override | ||
public Settings nodeSettings(int nodeOrdinal, Settings otherSettings) { | ||
return Settings.builder().put(super.nodeSettings(nodeOrdinal, otherSettings)).put("thread_pool.search.size", 4).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! That was the core of the issue:
Some context first:
- The
CancellableBulkScorer
we use to break the execution is called per search thread in the query. - It breaks the "for each doc" into blocks of 4096 docs (x2 every iteration), and checks for cancellation between blocks
- The test uses 100.000 docs.
- The test adds around 99000 permits to a semaphore, so only 99000 docs can be processed before the threads getting blocked (And the test failing)
- Which is what the test does: It expect cancelled queries to not reach that many processed docs, and break earlier.
Now, if there are 25 threads, it would consume up to 25*4096 = 102400 docs (before checking for cancellation), which would be more than the semaphore permits, and threads would get blocked. Which is what happened sometimes.
To the specific question: 4 is just a small number that shouldn't trigger this case by a great margin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Either add a short version of this explanation as a comment, or replace 4
with some computation based on the total number of documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test javadoc explaining what it does, and how it does it (Including those magic constants)
Fixes elastic#130770 Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck). The test was added in elastic#130452
Fixes elastic#130770 Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck). The test was added in elastic#130452
Fixes elastic#130770 Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck). The test was added in elastic#130452
Fixes elastic#130770 Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck). The test was added in elastic#130452
Fixes elastic#130770 Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck). The test was added in elastic#130452
Fixes elastic#130770 Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck). The test was added in elastic#130452
Fixes elastic#130770 Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck). The test was added in elastic#130452
Fixes #130770
Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).
The test was added in #130452