Skip to content

[BUG] Remote cluster state compatibility failuresΒ #20910

@andrross

Description

@andrross

Describe the bug

A BWC test for remote cluster state was added in #20221. This is failing intermittently:

https://build.ci.opensearch.org/job/gradle-check/72744/consoleText
https://build.ci.opensearch.org/job/gradle-check/72748/consoleText

  1. Build failure (top-level):

Task :qa:rolling-upgrade:v2.19.6-remote#twoThirdsUpgradedTest FAILED

Execution failed for task ':qa:rolling-upgrade:v2.19.6-remote#twoThirdsUpgradedTest'.

process was found dead while waiting for cluster health yellow, cluster{:qa:rolling-upgrade:v2.19.6-remote}

  1. IndexMetadata XContent deserialization failure (old node reading index metadata blobs written by upgraded cluster-manager):
[2026-03-18T11:33:45,850][ERROR][o.o.g.r.RemoteClusterStateService] [v2.19.6-remote-2] Failed to read cluster state from remote
org.opensearch.gateway.remote.RemoteStateTransferException: Download failed for java_for_range
        at org.opensearch.gateway.remote.RemoteIndexMetadataManager.lambda$getWrappedReadListener$3(RemoteIndexMetadataManager.java:159)
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:87)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955)
        ...
Caused by: java.lang.IllegalStateException: Can't get text on a START_ARRAY at -1:702
        at org.opensearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:99)
        at org.opensearch.core.xcontent.AbstractXContentParser.map(AbstractXContentParser.java:298)
        at org.opensearch.core.xcontent.AbstractXContentParser.mapStrings(AbstractXContentParser.java:282)
        at org.opensearch.cluster.metadata.IndexMetadata$Builder.fromXContent(IndexMetadata.java:2013)
        at org.opensearch.cluster.metadata.IndexMetadata.fromXContent(IndexMetadata.java:1080)
        at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.deserialize(ChecksumBlobStoreFormat.java:144)
        at org.opensearch.gateway.remote.model.RemoteIndexMetadata.deserialize(RemoteIndexMetadata.java:136)
        at org.opensearch.gateway.remote.model.RemoteIndexMetadata.deserialize(RemoteIndexMetadata.java:35)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.read(RemoteWriteableEntityBlobStore.java:77)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:85)

This repeats for every index in the cluster (test_index, test_recovery, index_with_replicas, test_index_old, geo_shape_index_old, test-index-segrep, etc.).

  1. DiscoveryNodes binary deserialization failure (old node reading discovery nodes blob written by upgraded cluster-manager):
[2026-03-18T11:33:45,859][ERROR][o.o.g.r.RemoteClusterStateService] [v2.19.6-remote-2] Failed to read cluster state from remote
org.opensearch.gateway.remote.RemoteStateTransferException: Download failed for nodes
        at org.opensearch.gateway.remote.RemoteClusterStateAttributesManager.lambda$getWrappedReadListener$3(RemoteClusterStateAttributesManager.java:103)
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:87)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955)
        ...
Caused by: java.lang.IllegalStateException: unexpected byte [0x08]
        at org.opensearch.core.common.io.stream.StreamInput.readBoolean(StreamInput.java:596)
        at org.opensearch.core.common.io.stream.StreamInput.readBoolean(StreamInput.java:586)
        at org.opensearch.cluster.node.DiscoveryNode.<init>(DiscoveryNode.java:344)
        at org.opensearch.cluster.node.DiscoveryNodes.readFrom(DiscoveryNodes.java:777)
        at org.opensearch.gateway.remote.model.RemoteDiscoveryNodes.lambda$static$0(RemoteDiscoveryNodes.java:37)
        at org.opensearch.repositories.blobstore.ChecksumWritableBlobStoreFormat.deserialize(ChecksumWritableBlobStoreFormat.java:105)
        at org.opensearch.gateway.remote.model.RemoteDiscoveryNodes.deserialize(RemoteDiscoveryNodes.java:101)
        at org.opensearch.gateway.remote.model.RemoteDiscoveryNodes.deserialize(RemoteDiscoveryNodes.java:32)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.read(RemoteWriteableEntityBlobStore.java:77)
        at org.opensearch.common.remote.RemoteWriteableEntityBlobStore.lambda$readAsync$0(RemoteWriteableEntityBlobStore.java:85)

Related component

Cluster Manager

To Reproduce

Not deterministic. I think it requires a scenario where there is a mixed version cluster, a new version node is elected as cluster manager, and the new version cluster manager publishes a new cluster state.

Expected behavior

The tests should pass every time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    βœ… Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions