Skip to content

[CI] MultiClusterSpecIT tests fail sometimes with SocketTimeoutException #134736

@astefan

Description

@astefan

CI Link

https://gradle-enterprise.elastic.co/s/hixbz3k3fauom

Repro line

gradlew ":x-pack:plugin:esql:qa:server:multi-clusters:v9.0.7#newToOld" -Dtests.class="org.elasticsearch.xpack.esql.ccq.MultiClusterSpecIT" -Dtests.method="test {csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer}" -Dtests.seed=3DF7A01082630BF4 -Dtests.bwc=true -Dtests.locale=ann-Latn-NG -Dtests.timezone=Indian/Christmas -Druntime.java=24

Does it reproduce?

No

Applicable branches

main

Failure history

No response

Failure excerpt

MultiClusterSpecIT > test {csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer} FAILED
    java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
        at __randomizedtesting.SeedInfo.seed([3DF7A01082630BF4:B5A39FCA2C9F660C]:0)
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
        at java.base/java.lang.Thread.run(Thread.java:1447)

REPRODUCE WITH: ./gradlew ":x-pack:plugin:esql:qa:server:multi-clusters:v9.0.7#newToOld" -Dtests.class="org.elasticsearch.xpack.esql.ccq.MultiClusterSpecIT" -Dtests.method="test {csv-spec:stats.MaxOfByte}" -Dtests.seed=3DF7A01082630BF4 -Dtests.bwc=true -Dtests.locale=ann-Latn-NG -Dtests.timezone=Indian/Christmas -Druntime.java=24

MultiClusterSpecIT > test {csv-spec:stats.MaxOfByte} FAILED
    org.elasticsearch.client.ResponseException: method [PUT], host [http://[::1]:45821], URI [/languages_lookup_non_unique_key], status line [HTTP/1.1 400 Bad Request]
    :)
    �ú�errorú�root_causeøú�type`resource_already_exists_exception�reasonàindex [languages_lookup_non_unique_key/e2f-qB8aRhmgXqaEQFESnA] already existsü�index_uuidUe2f-qB8aRhmgXqaEQFESnA�index^languages_lookup_non_unique_keyûùB`resource_already_exists_exceptionCàindex [languages_lookup_non_unique_key/e2f-qB8aRhmgXqaEQFESnA] already existsüDUe2f-qB8aRhmgXqaEQFESnAE^languages_lookup_non_unique_keyû�status$ û
        at __randomizedtesting.SeedInfo.seed([3DF7A01082630BF4:B5A39FCA2C9F660C]:0)
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
        at app//org.elasticsearch.client.RestClient.access$1900(RestClient.java:109)
        at app//org.elasticsearch.client.RestClient$1.completed(RestClient.java:401)
        at app//org.elasticsearch.client.RestClient$1.completed(RestClient.java:397)
        at app//org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122)

This type of failures started happening around Sept 10th with this failure and it kept failing for some CI runs, I think most of them (at least the 3-4 I looked at) with intake/main/[version]/bwc-snapshots. Another example here.

I couldn't explain why it happens, but:

[2025-09-15T21:07:15,784][INFO ][o.e.x.e.c.MultiClusterSpecIT][test] [csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer] before test
...............
[2025-09-15T21:07:29,274][INFO ][o.e.c.m.MetadataCreateIndexService] [remote_cluster-1] [multi_column_joinable_lookup] creating index, cause [api], templates [], shards [1]/[1]
[2025-09-15T21:07:29,300][INFO ][o.e.c.m.MetadataCreateIndexService] [local_cluster-0] creating index [multi_column_joinable_lookup] in project [default], cause [api], templates [], shards [1]/[1]
[2025-09-15T21:07:29,592][INFO ][o.e.c.r.a.AllocationService] [local_cluster-0] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[multi_column_joinable_lookup][0]]])." previous.health="YELLOW" reason="shards started [[multi_column_joinable_lookup][0]]"
[2025-09-15T21:07:29,623][INFO ][o.e.c.r.a.AllocationService] [remote_cluster-1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[multi_column_joinable_lookup][0]]])." previous.health="YELLOW" reason="shards started [[multi_column_joinable_lookup][0]]"
[2025-09-15T21:08:29,603][INFO ][o.e.x.e.CsvTestsDataLoader][test] Data loading of [2918] bytes into [multi_column_joinable_lookup] OK
[2025-09-15T21:08:29,621][INFO ][o.e.c.m.MetadataCreateIndexService] [local_cluster-0] creating index [clientips] in project [default], cause [api], templates [], shards [1]/[1]
[2025-09-15T21:08:29,623][INFO ][o.e.c.m.MetadataCreateIndexService] [remote_cluster-1] [clientips] creating index, cause [api], templates [], shards [1]/[1]
[2025-09-15T21:08:29,779][INFO ][o.e.x.e.CsvTestsDataLoader][test] Data loading of [392] bytes into [clientips] OK
....
[2025-09-15T21:08:47,935][INFO ][o.e.x.e.EnrichPolicyRunner] [remote_cluster-0] Policy [heights_policy]: Policy execution complete
[2025-09-15T21:08:47,995][INFO ][o.e.x.e.EnrichPolicyRunner] [local_cluster-1] Policy [heights_policy]: Policy execution complete
[2025-09-15T21:09:48,102][INFO ][o.e.x.e.c.MultiClusterSpecIT][test] [csv-spec:k8s-timeseries-avg-over-time.Avg_over_time_of_integer] after test

There are two big gaps in logs (21:07:29,623 - 21:08:29,603 and 21:08:47,995 - 21:09:48,102), each almost 60 seconds long.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Analytics/ES|QLAKA ESQL>test-failureTriaged test failures from CITeam:AnalyticsMeta label for analytical engine team (ESQL/Aggs/Geo)low-riskAn open issue or test failure that is a low risk to future releases

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions