Skip to content

Conversation

@craigtaverner
Copy link
Contributor

@craigtaverner craigtaverner commented Oct 19, 2025

Some recent test performance updates flushed out test fragility due to non-thread safety and other concurrency issues with test setup, and so two optimizations were reverted. A recent PR at #136780 added improved resiliency to concurrency issues in test setup which should allow us to bring back the performance optimizations.

The resiliency fix included in #136780 was originally extracted from the Views PR at #134995, which, due to early issues in view creation/deleting during test setup was reliably exposing the same errors seen in #134736, and so it seems likely it will also fix the issues seen in the optimizations.

Checklist:

Fixes #134890

@craigtaverner craigtaverner marked this pull request as ready for review October 20, 2025 12:59
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 20, 2025
@craigtaverner craigtaverner added >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL >test Issues or PRs that are addressing/adding tests and removed needs:triage Requires assignment of a team area label >non-issue labels Oct 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

}
}

private static final Protected INGEST = new Protected();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Views PR we use a separate VIEWS protector, semi-decoupling views from index creation. We could break the index parts into finer grained stuff too, but I see risks with that.

@craigtaverner
Copy link
Contributor Author

Sadly, we already see failures from the parallel data loading. I think the changes to EsqlSpecTestCase do give slightly better error messages, but not enough to determine the underlying cause. I'll do some local investigation to see if I can figure this out, otherwise revert that particular commit, and we deal with parallel data loading separately.

@craigtaverner craigtaverner mentioned this pull request Oct 21, 2025
20 tasks
});
try {
Response remote = remoteResponse.get();
Response local = localResponse.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: you could use PlainActionFuture instead of anonymous listeners implementations here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code comes from a revert of a revert of an approved PR, I do not want to fiddle with it.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell what exactly is the novelty in this PR compared to the two previous ones it reverts. You had this commit initially, but the code didn't make it in the final variant of the PR.

It would help me, at least, to compare what we did before and what this PR does new now, if you could point out in your changes or with a comment the relevant bits. Thanks.

@craigtaverner
Copy link
Contributor Author

I can't tell what exactly is the novelty in this PR compared to the two previous ones it reverts. You had this commit initially, but the code didn't make it in the final variant of the PR.

Very good point. It turns out that the first commit you pointed out was actually also part of another PR, the one at #136780, which was merged two weeks ago, and this this PR got those updates when it merged main, effectively erasing the first commit, since it replaced it. So the real fix has been in for a few weeks now, and this PR now represents only the two optimizations being reverted.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@craigtaverner craigtaverner changed the title Make EsqlSpecTestCase more resilient and re-enable some performance updates Re-enable some performance updates to ES|QL spec tests after resilience improvement to EsqlSpecTestCase Nov 3, 2025
@craigtaverner craigtaverner merged commit 4d3e27f into elastic:main Nov 3, 2025
34 checks passed

private static <T> void executeInParallel(List<T> items, IOConsumer<T> consumer, String errorMessage) {
Semaphore semaphore = new Semaphore(PARALLEL_THREADS);
ESTestCase.runInParallel(items.size(), i -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESTestCase.runInParallel internally creates a thread per each task.
Today we have 47 data sets but we only want to populate 10 at the time. I believe it would be much cheaper to use Executors.newFixedThreadPool(threads) rather than create a thread per task, especially if we want to limit parallelism.

craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Nov 3, 2025
…resilience improvement to EsqlSpecTestCase (elastic#136781)"

This reverts commit 4d3e27f.
elasticsearchmachine pushed a commit that referenced this pull request Nov 3, 2025
This reverts commit 4d3e27f. And then
re-applies the revert to
#134738, so we only get an
effective revert of 
#134511.
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Nov 4, 2025
…resilience improvement to EsqlSpecTestCase (elastic#136781)"

This reverts commit 4d3e27f.
elasticsearchmachine pushed a commit that referenced this pull request Nov 4, 2025
…resilience improvement to EsqlSpecTestCase (#136781)" (#137589)

This reverts commit 4d3e27f.

Yesterday we merged two performance updates to ES|QL Spec tests that had
been reverted two months back after noticing some increased test
flakiness. However, we immediately saw increased failures due to one of
them, so we reverted it. Having done that we continue to see other
failures, that may or may not be related to the remaining performance
update. This PR reverts that too.
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 6, 2025
BASE=9770442d77ceac0dda6f6e6a07d0d386885ab52a
HEAD=955fbccc5328132bd5bcbb95bab60ca5b355e8ca
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025
BASE=9770442d77ceac0dda6f6e6a07d0d386885ab52a
HEAD=955fbccc5328132bd5bcbb95bab60ca5b355e8ca
Branch=main
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Nov 10, 2025
… resilience improvement to EsqlSpecTestCase (elastic#136781)" (elastic#137589)

This reverts commit 8e4ad76.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESQL: re-apply the parallel loading of test data

4 participants