[CI] MultiClusterSpecIT tests fail sometimes with SocketTimeoutException

### CI Link

https://gradle-enterprise.elastic.co/s/hixbz3k3fauom

### Repro line

```
gradlew ":x-pack:plugin:esql:qa:server:multi-clusters:v9.0.7#newToOld" -Dtests.class="org.elasticsearch.xpack.esql.ccq.MultiClusterSpecIT" -Dtests.method="test {csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer}" -Dtests.seed=3DF7A01082630BF4 -Dtests.bwc=true -Dtests.locale=ann-Latn-NG -Dtests.timezone=Indian/Christmas -Druntime.java=24
```

### Does it reproduce?

No

### Applicable branches

main

### Failure history

_No response_

### Failure excerpt

```
MultiClusterSpecIT > test {csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer} FAILED
    java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
        at __randomizedtesting.SeedInfo.seed([3DF7A01082630BF4:B5A39FCA2C9F660C]:0)
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
        at java.base/java.lang.Thread.run(Thread.java:1447)

REPRODUCE WITH: ./gradlew ":x-pack:plugin:esql:qa:server:multi-clusters:v9.0.7#newToOld" -Dtests.class="org.elasticsearch.xpack.esql.ccq.MultiClusterSpecIT" -Dtests.method="test {csv-spec:stats.MaxOfByte}" -Dtests.seed=3DF7A01082630BF4 -Dtests.bwc=true -Dtests.locale=ann-Latn-NG -Dtests.timezone=Indian/Christmas -Druntime.java=24

MultiClusterSpecIT > test {csv-spec:stats.MaxOfByte} FAILED
    org.elasticsearch.client.ResponseException: method [PUT], host [http://[::1]:45821], URI [/languages_lookup_non_unique_key], status line [HTTP/1.1 400 Bad Request]
    :)
    úerrorúroot_causeøútype`resource_already_exists_exceptionreasonàindex [languages_lookup_non_unique_key/e2f-qB8aRhmgXqaEQFESnA] already existsüindex_uuidUe2f-qB8aRhmgXqaEQFESnAindex^languages_lookup_non_unique_keyûùB`resource_already_exists_exceptionCàindex [languages_lookup_non_unique_key/e2f-qB8aRhmgXqaEQFESnA] already existsüDUe2f-qB8aRhmgXqaEQFESnAE^languages_lookup_non_unique_keyûstatus$ û
        at __randomizedtesting.SeedInfo.seed([3DF7A01082630BF4:B5A39FCA2C9F660C]:0)
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
        at app//org.elasticsearch.client.RestClient.access$1900(RestClient.java:109)
        at app//org.elasticsearch.client.RestClient$1.completed(RestClient.java:401)
        at app//org.elasticsearch.client.RestClient$1.completed(RestClient.java:397)
        at app//org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122)
```

This type of failures started happening around Sept 10th with [this failure](https://gradle-enterprise.elastic.co/s/r6g2ulaom7kw2) and it kept failing for some CI runs, I _think_ most of them (at least the 3-4 I looked at) with `intake/main/[version]/bwc-snapshots`. Another example [here](https://gradle-enterprise.elastic.co/s/hixbz3k3fauom).

I couldn't explain why it happens, but:
- this is almost at the time PR https://github.com/elastic/elasticsearch/pull/134086 was merged. This one adds parallel loading for `MultiClusterSpecIT`
- looking in the logs of a recent such failure, I am noticing this long delay that I cannot explain:
```
[2025-09-15T21:07:15,784][INFO ][o.e.x.e.c.MultiClusterSpecIT][test] [csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer] before test
...............
[2025-09-15T21:07:29,274][INFO ][o.e.c.m.MetadataCreateIndexService] [remote_cluster-1] [multi_column_joinable_lookup] creating index, cause [api], templates [], shards [1]/[1]
[2025-09-15T21:07:29,300][INFO ][o.e.c.m.MetadataCreateIndexService] [local_cluster-0] creating index [multi_column_joinable_lookup] in project [default], cause [api], templates [], shards [1]/[1]
[2025-09-15T21:07:29,592][INFO ][o.e.c.r.a.AllocationService] [local_cluster-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[multi_column_joinable_lookup][0]]])." previous.health="YELLOW" reason="shards started [[multi_column_joinable_lookup][0]]"
[2025-09-15T21:07:29,623][INFO ][o.e.c.r.a.AllocationService] [remote_cluster-1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[multi_column_joinable_lookup][0]]])." previous.health="YELLOW" reason="shards started [[multi_column_joinable_lookup][0]]"
[2025-09-15T21:08:29,603][INFO ][o.e.x.e.CsvTestsDataLoader][test] Data loading of [2918] bytes into [multi_column_joinable_lookup] OK
[2025-09-15T21:08:29,621][INFO ][o.e.c.m.MetadataCreateIndexService] [local_cluster-0] creating index [clientips] in project [default], cause [api], templates [], shards [1]/[1]
[2025-09-15T21:08:29,623][INFO ][o.e.c.m.MetadataCreateIndexService] [remote_cluster-1] [clientips] creating index, cause [api], templates [], shards [1]/[1]
[2025-09-15T21:08:29,779][INFO ][o.e.x.e.CsvTestsDataLoader][test] Data loading of [392] bytes into [clientips] OK
....
[2025-09-15T21:08:47,935][INFO ][o.e.x.e.EnrichPolicyRunner] [remote_cluster-0] Policy [heights_policy]: Policy execution complete
[2025-09-15T21:08:47,995][INFO ][o.e.x.e.EnrichPolicyRunner] [local_cluster-1] Policy [heights_policy]: Policy execution complete
[2025-09-15T21:09:48,102][INFO ][o.e.x.e.c.MultiClusterSpecIT][test] [csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer] after test
```

There are two big gaps in logs (21:07:29,623 - 21:08:29,603 and 21:08:47,995 - 21:09:48,102), each almost 60 seconds long.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] MultiClusterSpecIT tests fail sometimes with SocketTimeoutException #134736

CI Link

Repro line

Does it reproduce?

Applicable branches

Failure history

Failure excerpt

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CI] MultiClusterSpecIT tests fail sometimes with SocketTimeoutException #134736

Description

CI Link

Repro line

Does it reproduce?

Applicable branches

Failure history

Failure excerpt

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions