Skip to content

Test remote state BWC in both CM version directions#20912

Closed
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:RemotePublicationClusterStateIT-deterministic-failure
Closed

Test remote state BWC in both CM version directions#20912
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:RemotePublicationClusterStateIT-deterministic-failure

Conversation

@andrross
Copy link
Member

The RemotePublicationClusterStateIT test was not deterministically exercising the case where a new-version cluster-manager writes remote cluster state that old-version nodes must deserialize. This meant backwards-incompatible serialization changes could be merged without being caught, since the test would pass whenever an old-version node happened to win the cluster-manager election.

In the mixed-version (one-third upgraded) phase, the test now explicitly tests both directions:

  1. Old CM writes state, new nodes read from remote store
  2. New CM writes state, old nodes read from remote store

To force a specific version to be cluster-manager, the test uses the voting config exclusions API to repeatedly exclude the current CM until a node of the desired version wins the election. Exclusions are cleared immediately after each re-election so no node leaves the cluster.

Also changes the remote test cluster's dependency on the non-remote test cluster from dependsOn to mustRunAfter, so the remote tests can be run independently without first running the full non-remote suite.

Related Issues

Related to #20910

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

PR Reviewer Guide 🔍

(Review updated until commit f1489d0)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Infinite Loop Risk

In ensureClusterManagerVersion, if the cluster only has nodes of one version (e.g., all old or all new), the while loop will spin until the 1-minute deadline. In a mixed cluster with only one node of the desired version, repeated exclusions may not reliably elect that node. The logic should also verify that a node of the desired version actually exists before attempting re-elections.

private void ensureClusterManagerVersion(boolean newVersion) throws Exception {
    String versionLabel = newVersion ? "new" : "old";
    long deadline = System.nanoTime() + TimeUnit.MINUTES.toNanos(1);
    int attempt = 0;
    while (isClusterManagerOnNewVersion() != newVersion) {
        if (System.nanoTime() > deadline) {
            fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute");
        }
        String cmName = getClusterManagerNodeName();
        logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName);

        Request exclude = new Request("POST", "/_cluster/voting_config_exclusions");
        exclude.addParameter("node_names", cmName);
        exclude.addParameter("timeout", "30s");
        assertOK(client().performRequest(exclude));

        // Wait for a different node to become CM
        assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()));

        // Clear exclusion immediately so the node stays in the cluster
        clearVotingConfigExclusions();
    }
    logger.info("Cluster manager is on {} version: [{}]", versionLabel, getClusterManagerNodeName());
}
Duplicate API Calls

Both isClusterManagerOnNewVersion() and getClusterManagerNodeName() independently call _cluster/state and _nodes, resulting in duplicate HTTP requests. These could be consolidated into a single helper that returns a structured object, reducing overhead especially inside the polling loop of ensureClusterManagerVersion.

private boolean isClusterManagerOnNewVersion() throws IOException {
    Map<String, Object> clusterState = entityAsMap(client().performRequest(new Request("GET", "_cluster/state")));
    String clusterManagerNodeId = (String) clusterState.get("master_node");

    Map<String, Object> nodesInfo = entityAsMap(client().performRequest(new Request("GET", "_nodes")));
    Map<String, Object> nodes = (Map<String, Object>) nodesInfo.get("nodes");
    Map<String, Object> cmNode = (Map<String, Object>) nodes.get(clusterManagerNodeId);
    Version cmVersion = Version.fromString((String) cmNode.get("version"));
    return cmVersion.after(UPGRADE_FROM_VERSION);
}

/**
 * Ensures the cluster-manager is on the desired version by repeatedly excluding the current CM
 * to trigger re-elections until a node of the desired version wins.
 */
private void ensureClusterManagerVersion(boolean newVersion) throws Exception {
    String versionLabel = newVersion ? "new" : "old";
    long deadline = System.nanoTime() + TimeUnit.MINUTES.toNanos(1);
    int attempt = 0;
    while (isClusterManagerOnNewVersion() != newVersion) {
        if (System.nanoTime() > deadline) {
            fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute");
        }
        String cmName = getClusterManagerNodeName();
        logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName);

        Request exclude = new Request("POST", "/_cluster/voting_config_exclusions");
        exclude.addParameter("node_names", cmName);
        exclude.addParameter("timeout", "30s");
        assertOK(client().performRequest(exclude));

        // Wait for a different node to become CM
        assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()));

        // Clear exclusion immediately so the node stays in the cluster
        clearVotingConfigExclusions();
    }
    logger.info("Cluster manager is on {} version: [{}]", versionLabel, getClusterManagerNodeName());
}

private String getClusterManagerNodeName() throws IOException {
    Map<String, Object> clusterState = entityAsMap(client().performRequest(new Request("GET", "_cluster/state")));
    String cmNodeId = (String) clusterState.get("master_node");
    Map<String, Object> nodesInfo = entityAsMap(client().performRequest(new Request("GET", "_nodes")));
    Map<String, Object> nodes = (Map<String, Object>) nodesInfo.get("nodes");
    Map<String, Object> cmNode = (Map<String, Object>) nodes.get(cmNodeId);
    return (String) cmNode.get("name");
}
Race Condition

In ensureClusterManagerVersion, after assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())) confirms a new CM was elected and exclusions are cleared, the next call to isClusterManagerOnNewVersion() at the top of the while loop may observe a transient state. There is a potential race between clearing exclusions and the cluster stabilizing, which could cause the loop to exclude the wrong node.

while (isClusterManagerOnNewVersion() != newVersion) {
    if (System.nanoTime() > deadline) {
        fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute");
    }
    String cmName = getClusterManagerNodeName();
    logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName);

    Request exclude = new Request("POST", "/_cluster/voting_config_exclusions");
    exclude.addParameter("node_names", cmName);
    exclude.addParameter("timeout", "30s");
    assertOK(client().performRequest(exclude));

    // Wait for a different node to become CM
    assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()));

    // Clear exclusion immediately so the node stays in the cluster
    clearVotingConfigExclusions();
}
Transient Settings Leak

makeClusterStateChange sets a transient cluster setting (cluster.routing.allocation.exclude._name) but never cleans it up. These transient settings will persist for the remainder of the test run and could interfere with subsequent test phases or other tests.

private void makeClusterStateChange(String suffix) throws IOException {
    Request putSettings = new Request("PUT", "_cluster/settings");
    putSettings.setJsonEntity(String.format(Locale.ROOT, """
        {
            "transient": {
                "cluster.routing.allocation.exclude._name": "nonexistent_node_%s"
            }
        }""", suffix));
    assertOK(client().performRequest(putSettings));
}

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

PR Code Suggestions ✨

Latest suggestions up to f1489d0
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure voting exclusions are always cleared

If clearVotingConfigExclusions() throws an exception after a voting exclusion is
added, the excluded node will remain excluded permanently, potentially leaving the
cluster in a broken state for subsequent test steps. The exclusion cleanup should be
placed in a try/finally block to guarantee it always runs.

qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [126-143]

 while (isClusterManagerOnNewVersion() != newVersion) {
         if (System.nanoTime() > deadline) {
             fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute");
         }
         String cmName = getClusterManagerNodeName();
         logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName);
 
         Request exclude = new Request("POST", "/_cluster/voting_config_exclusions");
         exclude.addParameter("node_names", cmName);
         exclude.addParameter("timeout", "30s");
         assertOK(client().performRequest(exclude));
 
-        // Wait for a different node to become CM
-        assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()));
-
-        // Clear exclusion immediately so the node stays in the cluster
-        clearVotingConfigExclusions();
+        try {
+            // Wait for a different node to become CM
+            assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()));
+        } finally {
+            // Clear exclusion so the node stays in the cluster
+            clearVotingConfigExclusions();
+        }
     }
Suggestion importance[1-10]: 6

__

Why: Valid defensive programming suggestion - wrapping assertBusy in a try/finally ensures clearVotingConfigExclusions() is always called even if the assertion fails, preventing the cluster from being left in a broken state. This is a real concern in test infrastructure.

Low
Guard against null cluster-manager node lookup

If clusterManagerNodeId is not found in the nodes map (e.g., due to a race condition
between fetching cluster state and nodes info), cmNode will be null, causing a
NullPointerException. A null check should be added before accessing cmNode.

qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [113-115]

 Map<String, Object> cmNode = (Map<String, Object>) nodes.get(clusterManagerNodeId);
+    if (cmNode == null) {
+        throw new IOException("Cluster manager node [" + clusterManagerNodeId + "] not found in nodes info");
+    }
     Version cmVersion = Version.fromString((String) cmNode.get("version"));
     return cmVersion.after(UPGRADE_FROM_VERSION);
Suggestion importance[1-10]: 5

__

Why: A valid defensive check - cmNode could be null due to a race condition between fetching cluster state and nodes info, which would cause a NullPointerException. Adding a null check with a meaningful error message improves robustness and debuggability.

Low
General
Clean up transient cluster settings after test

The transient routing exclusion settings set by makeClusterStateChange accumulate
across calls and are never cleaned up. After the test,
cluster.routing.allocation.exclude._name will still be set to
"nonexistent_node_new_cm", which could interfere with subsequent test phases or
other tests. The transient setting should be cleared after each use or at the end of
the block.

qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [79-87]

 ensureClusterManagerVersion(false);
         makeClusterStateChange("old_cm");
         ensureAllNodesHealthy();
         verifyClusterState();
 
         ensureClusterManagerVersion(true);
         makeClusterStateChange("new_cm");
         ensureAllNodesHealthy();
         verifyClusterState();
 
+        // Clean up transient routing exclusion settings
+        Request clearSettings = new Request("PUT", "_cluster/settings");
+        clearSettings.setJsonEntity("""
+            {
+                "transient": {
+                    "cluster.routing.allocation.exclude._name": null
+                }
+            }""");
+        assertOK(client().performRequest(clearSettings));
+
Suggestion importance[1-10]: 5

__

Why: The transient cluster.routing.allocation.exclude._name setting set by makeClusterStateChange is never cleared, which could interfere with subsequent test phases. Cleaning up the setting after use is good test hygiene and prevents potential test pollution.

Low

Previous suggestions

Suggestions up to commit 892518d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure voting exclusions are always cleared

If clearVotingConfigExclusions() throws an exception after a voting exclusion is
added, the excluded node will remain excluded permanently, potentially leaving the
cluster in a broken state for subsequent test steps. The exclusion cleanup should be
placed in a try/finally block to guarantee it always runs.

qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [126-143]

 while (isClusterManagerOnNewVersion() != newVersion) {
         if (System.nanoTime() > deadline) {
             fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute");
         }
         String cmName = getClusterManagerNodeName();
         logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName);
 
         Request exclude = new Request("POST", "/_cluster/voting_config_exclusions");
         exclude.addParameter("node_names", cmName);
         exclude.addParameter("timeout", "30s");
         assertOK(client().performRequest(exclude));
 
-        // Wait for a different node to become CM
-        assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()));
-
-        // Clear exclusion immediately so the node stays in the cluster
-        clearVotingConfigExclusions();
+        try {
+            // Wait for a different node to become CM
+            assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()));
+        } finally {
+            // Clear exclusion so the node stays in the cluster
+            clearVotingConfigExclusions();
+        }
     }
Suggestion importance[1-10]: 7

__

Why: This is a valid defensive improvement - if assertBusy throws, the voting exclusion would remain, potentially breaking subsequent test steps. Using try/finally ensures cleanup always happens.

Medium
Guard against null cluster-manager node lookup

If cmNodeId is not found in the nodes map (e.g., due to a race condition where the
CM changes between the two API calls), cmNode will be null and cmNode.get("name")
will throw a NullPointerException. A null check with a meaningful error message
should be added.

qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [152-153]

 Map<String, Object> cmNode = (Map<String, Object>) nodes.get(cmNodeId);
+    if (cmNode == null) {
+        throw new IllegalStateException("Cluster manager node [" + cmNodeId + "] not found in nodes info response");
+    }
     return (String) cmNode.get("name");
Suggestion importance[1-10]: 5

__

Why: A race condition between the two API calls could result in cmNode being null, causing a NullPointerException. Adding a null check with a meaningful error message improves robustness and debuggability.

Low
General
Clean up transient settings after use

The method accumulates transient cluster.routing.allocation.exclude._name settings
across calls (once with "old_cm" and once with "new_cm"), but never cleans them up.
These leftover transient settings could interfere with subsequent test phases or
other tests. The setting should be cleared after each use or at the end of the test.

qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [160-169]

 private void makeClusterStateChange(String suffix) throws IOException {
     Request putSettings = new Request("PUT", "_cluster/settings");
     putSettings.setJsonEntity(String.format(Locale.ROOT, """
         {
             "transient": {
                 "cluster.routing.allocation.exclude._name": "nonexistent_node_%s"
             }
         }""", suffix));
     assertOK(client().performRequest(putSettings));
+
+    // Clean up the transient setting after the state change is published
+    Request clearSettings = new Request("PUT", "_cluster/settings");
+    clearSettings.setJsonEntity("""
+        {
+            "transient": {
+                "cluster.routing.allocation.exclude._name": null
+            }
+        }""");
+    assertOK(client().performRequest(clearSettings));
 }
Suggestion importance[1-10]: 6

__

Why: The transient cluster.routing.allocation.exclude._name settings accumulate across calls and are never cleaned up, which could interfere with ensureAllNodesHealthy() or subsequent test phases. Cleaning up after each call is a valid improvement.

Low

@github-actions
Copy link
Contributor

❌ Gradle check result for 892518d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

The RemotePublicationClusterStateIT test was not deterministically
exercising the case where a new-version cluster-manager writes remote
cluster state that old-version nodes must deserialize. This meant
backwards-incompatible serialization changes could be merged without
being caught, since the test would pass whenever an old-version node
happened to win the cluster-manager election.

In the mixed-version (one-third upgraded) phase, the test now
explicitly tests both directions:
1. Old CM writes state, new nodes read from remote store
2. New CM writes state, old nodes read from remote store

To force a specific version to be cluster-manager, the test uses the
voting config exclusions API to repeatedly exclude the current CM
until a node of the desired version wins the election. Exclusions are
cleared immediately after each re-election so no node leaves the
cluster.

Also changes the remote test cluster's dependency on the non-remote
test cluster from dependsOn to mustRunAfter, so the remote tests can
be run independently without first running the full non-remote suite.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross force-pushed the RemotePublicationClusterStateIT-deterministic-failure branch from 892518d to f1489d0 Compare March 19, 2026 16:29
@github-actions
Copy link
Contributor

Persistent review updated to latest commit f1489d0

@github-actions
Copy link
Contributor

❌ Gradle check result for f1489d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for f1489d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Member Author

Closing as forward compatibility is explicitly not supported here. See #20948

@andrross andrross closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant