Skip to content

Conversation

ivancea
Copy link
Contributor

@ivancea ivancea commented Jul 8, 2025

Fixes #130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in #130452

@ivancea ivancea requested a review from nik9000 July 8, 2025 10:39
@ivancea ivancea added >test Issues or PRs that are addressing/adding tests :Analytics/Aggregations Aggregations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged v9.0.4 v8.18.4 v8.17.9 v9.2.0 v9.1.1 v8.19.1 labels Jul 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Comment on lines +140 to +141
assertThat(searchRequestFuture.isCancelled(), equalTo(false));
assertThat(searchRequestFuture.isDone(), equalTo(false));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message looks better this way IMO, no functional change here

assertBusy(() -> {
assertTrue(searchRequestFuture.isDone());
assertThat(getSearchTasks(), empty());
assertThat("Search request didn't finish", searchRequestFuture.isDone(), equalTo(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an assertTrue that also takes a String?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, I see now that you used it to get the hamcrest error message, and apparently assertTrue(msg, bool) is from junit, not hamcrest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The message with assertTrue/False was just "AssertionError: null", while the equalTo(true) was like "false != true". Which isn't much better, but at least tells you something. Anyway, with the message they should be similar, but I changed them anyway XD

return PauseScriptPlugin.class;
@Override
public Settings nodeSettings(int nodeOrdinal, Settings otherSettings) {
return Settings.builder().put(super.nodeSettings(nodeOrdinal, otherSettings)).put("thread_pool.search.size", 4).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why 4?

Copy link
Contributor Author

@ivancea ivancea Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! That was the core of the issue:

Some context first:

  • The CancellableBulkScorer we use to break the execution is called per search thread in the query.
  • It breaks the "for each doc" into blocks of 4096 docs (x2 every iteration), and checks for cancellation between blocks
  • The test uses 100.000 docs.
  • The test adds around 99000 permits to a semaphore, so only 99000 docs can be processed before the threads getting blocked (And the test failing)
  • Which is what the test does: It expect cancelled queries to not reach that many processed docs, and break earlier.

Now, if there are 25 threads, it would consume up to 25*4096 = 102400 docs (before checking for cancellation), which would be more than the semaphore permits, and threads would get blocked. Which is what happened sometimes.

To the specific question: 4 is just a small number that shouldn't trigger this case by a great margin.

Copy link
Contributor

@GalLalouche GalLalouche Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Either add a short version of this explanation as a comment, or replace 4 with some computation based on the total number of documents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test javadoc explaining what it does, and how it does it (Including those magic constants)

@ivancea ivancea requested a review from GalLalouche July 8, 2025 15:55
@ivancea ivancea enabled auto-merge (squash) July 8, 2025 16:01
@ivancea ivancea merged commit 4659e89 into elastic:main Jul 9, 2025
33 checks passed
@ivancea ivancea deleted the fix-filters-agg-test branch July 9, 2025 11:13
ivancea added a commit to ivancea/elasticsearch that referenced this pull request Jul 9, 2025
Fixes elastic#130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in elastic#130452
ivancea added a commit to ivancea/elasticsearch that referenced this pull request Jul 9, 2025
Fixes elastic#130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in elastic#130452
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.0
8.18
8.17
9.1
8.19

ivancea added a commit to ivancea/elasticsearch that referenced this pull request Jul 9, 2025
Fixes elastic#130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in elastic#130452
ivancea added a commit to ivancea/elasticsearch that referenced this pull request Jul 9, 2025
Fixes elastic#130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in elastic#130452
ivancea added a commit to ivancea/elasticsearch that referenced this pull request Jul 9, 2025
Fixes elastic#130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in elastic#130452
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2025
Fixes #130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in #130452
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2025
Fixes #130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in #130452
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2025
Fixes #130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in #130452
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2025
Fixes #130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in #130452
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2025
Fixes #130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in #130452
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 17, 2025
Fixes elastic#130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in elastic#130452
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 17, 2025
Fixes elastic#130770

Both a bigger thread pool and not draining the semaphore permits were leading to failing tests sometimes because of blocked threads (Too many threads searching would end up draining all permits in parallel, and getting stuck).

The test was added in elastic#130452
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations auto-backport Automatically create backport pull requests when merged Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests v8.17.9 v8.18.4 v8.19.1 v9.0.4 v9.1.1 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] FiltersCancellationIT testFiltersSubAggsCancellation failing

4 participants