Skip to content

Conversation

@gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Jun 18, 2025

Recently we have been seeing an increase on the cluster initialisation failures, for example:

It appears that starting a node exceeds the 3 minute time out:

-----> Test starting
[2025-06-16T01:16:35,965][INFO ][o.e.t.c.l.AbstractLocalClusterFactory] [[test-cluster-node-executor-0]] Starting Elasticsearch node 'test-cluster-1'
[2025-06-16T01:16:35,967][INFO ][o.e.t.c.l.AbstractLocalClusterFactory] [[test-cluster-node-executor-1]] Starting Elasticsearch node 'test-cluster-0'

-----> test-cluster-0 started up 3 minutes and 3 seconds later.
[2025-06-16T01:19:38,478][INFO ][o.e.t.TransportService   ] [test-cluster-0] publish_address {127.0.0.1:45839}, bound_addresses {[::1]:34877}, {127.0.0.1:45839}

-----> test-cluster-1 has not started within the timeout.
[2025-06-16T01:19:45,043][INFO ][o.e.n.Node               ] [test-cluster-1] starting ...

In the last two months a lot of tests were converted to use the newer rest test framework. Some tests start 1 node, other start 3 nodes, others even more, the framework runs tests in parallel but it doesn't know how many nodes its tests needs meaning that running 3 tests in parallel, for example, can be very different when they are single node clusters or 3 node clusters etc. During this execution we saw the 3x more CPU load than what we would want to have ideally.

Currently there is no good solution for this because if dial down the concurrency we will use the nodes inefficiently, but if we keep the concurrency to where it is we risk longer start up times. Considering that the starting time of elasticsearch is not related to this test, we choose to increase the timeout to reduce the noise.

Fixes:

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Jun 18, 2025
@gmarouli gmarouli added >test Issues or PRs that are addressing/adding tests auto-backport Automatically create backport pull requests when merged v8.19.0 :Delivery/Build Build or test infrastructure labels Jun 18, 2025
@elasticsearchmachine elasticsearchmachine added Team:Delivery Meta label for Delivery team and removed needs:triage Requires assignment of a team area label labels Jun 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@gmarouli gmarouli merged commit ee5d652 into elastic:main Jun 19, 2025
27 checks passed
@gmarouli gmarouli deleted the increase-starting-cluster-timeout branch June 19, 2025 14:37
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Jun 19, 2025
In the last two months a lot of tests were converted to use the newer rest test framework. Some tests start 1 node, other start 3 nodes, others even more, the framework runs tests in parallel but it doesn't know how many nodes its tests needs meaning that running 3 tests in parallel, for example, can be very different when they are single node clusters or 3 node clusters etc. During this execution we saw the 3x more CPU load than what we would want to have ideally.

Currently there is no good solution for this because if dial down the concurrency we will use the nodes inefficiently, but if we keep the concurrency to where it is we risk longer start up times. Considering that the starting time of elasticsearch is not related to this test, we choose to increase the timeout to reduce the noise.
elasticsearchmachine pushed a commit that referenced this pull request Jun 19, 2025
…29720)

In the last two months a lot of tests were converted to use the newer rest test framework. Some tests start 1 node, other start 3 nodes, others even more, the framework runs tests in parallel but it doesn't know how many nodes its tests needs meaning that running 3 tests in parallel, for example, can be very different when they are single node clusters or 3 node clusters etc. During this execution we saw the 3x more CPU load than what we would want to have ideally.

Currently there is no good solution for this because if dial down the concurrency we will use the nodes inefficiently, but if we keep the concurrency to where it is we risk longer start up times. Considering that the starting time of elasticsearch is not related to this test, we choose to increase the timeout to reduce the noise.
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
In the last two months a lot of tests were converted to use the newer rest test framework. Some tests start 1 node, other start 3 nodes, others even more, the framework runs tests in parallel but it doesn't know how many nodes its tests needs meaning that running 3 tests in parallel, for example, can be very different when they are single node clusters or 3 node clusters etc. During this execution we saw the 3x more CPU load than what we would want to have ideally.

Currently there is no good solution for this because if dial down the concurrency we will use the nodes inefficiently, but if we keep the concurrency to where it is we risk longer start up times. Considering that the starting time of elasticsearch is not related to this test, we choose to increase the timeout to reduce the noise.
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 25, 2025
In the last two months a lot of tests were converted to use the newer rest test framework. Some tests start 1 node, other start 3 nodes, others even more, the framework runs tests in parallel but it doesn't know how many nodes its tests needs meaning that running 3 tests in parallel, for example, can be very different when they are single node clusters or 3 node clusters etc. During this execution we saw the 3x more CPU load than what we would want to have ideally.

Currently there is no good solution for this because if dial down the concurrency we will use the nodes inefficiently, but if we keep the concurrency to where it is we risk longer start up times. Considering that the starting time of elasticsearch is not related to this test, we choose to increase the timeout to reduce the noise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] MixedClusterDownsampleRestIT class failing [CI] HuggingFaceServiceUpgradeIT class failing

3 participants