Test remote state BWC in both CM version directions by andrross · Pull Request #20912 · opensearch-project/OpenSearch

andrross · 2026-03-18T19:30:45Z

The RemotePublicationClusterStateIT test was not deterministically exercising the case where a new-version cluster-manager writes remote cluster state that old-version nodes must deserialize. This meant backwards-incompatible serialization changes could be merged without being caught, since the test would pass whenever an old-version node happened to win the cluster-manager election.

In the mixed-version (one-third upgraded) phase, the test now explicitly tests both directions:

Old CM writes state, new nodes read from remote store
New CM writes state, old nodes read from remote store

To force a specific version to be cluster-manager, the test uses the voting config exclusions API to repeatedly exclude the current CM until a node of the desired version wins the election. Exclusions are cleared immediately after each re-election so no node leaves the cluster.

Also changes the remote test cluster's dependency on the non-remote test cluster from dependsOn to mustRunAfter, so the remote tests can be run independently without first running the full non-remote suite.

Related Issues

Related to #20910

Check List

Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-03-18T19:32:11Z

PR Reviewer Guide 🔍

(Review updated until commit `f1489d0`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Infinite Loop Risk In `ensureClusterManagerVersion`, if the cluster only has nodes of one version (e.g., all old or all new), the while loop will spin until the 1-minute deadline. In a mixed cluster with only one node of the desired version, repeated exclusions may not reliably elect that node. The logic should also verify that a node of the desired version actually exists before attempting re-elections. private void ensureClusterManagerVersion(boolean newVersion) throws Exception { String versionLabel = newVersion ? "new" : "old"; long deadline = System.nanoTime() + TimeUnit.MINUTES.toNanos(1); int attempt = 0; while (isClusterManagerOnNewVersion() != newVersion) { if (System.nanoTime() > deadline) { fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute"); } String cmName = getClusterManagerNodeName(); logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName); Request exclude = new Request("POST", "/_cluster/voting_config_exclusions"); exclude.addParameter("node_names", cmName); exclude.addParameter("timeout", "30s"); assertOK(client().performRequest(exclude)); // Wait for a different node to become CM assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())); // Clear exclusion immediately so the node stays in the cluster clearVotingConfigExclusions(); } logger.info("Cluster manager is on {} version: [{}]", versionLabel, getClusterManagerNodeName()); } Duplicate API Calls Both `isClusterManagerOnNewVersion()` and `getClusterManagerNodeName()` independently call `_cluster/state` and `_nodes`, resulting in duplicate HTTP requests. These could be consolidated into a single helper that returns a structured object, reducing overhead especially inside the polling loop of `ensureClusterManagerVersion`. private boolean isClusterManagerOnNewVersion() throws IOException { Map<String, Object> clusterState = entityAsMap(client().performRequest(new Request("GET", "_cluster/state"))); String clusterManagerNodeId = (String) clusterState.get("master_node"); Map<String, Object> nodesInfo = entityAsMap(client().performRequest(new Request("GET", "_nodes"))); Map<String, Object> nodes = (Map<String, Object>) nodesInfo.get("nodes"); Map<String, Object> cmNode = (Map<String, Object>) nodes.get(clusterManagerNodeId); Version cmVersion = Version.fromString((String) cmNode.get("version")); return cmVersion.after(UPGRADE_FROM_VERSION); } /** * Ensures the cluster-manager is on the desired version by repeatedly excluding the current CM * to trigger re-elections until a node of the desired version wins. / private void ensureClusterManagerVersion(boolean newVersion) throws Exception { String versionLabel = newVersion ? "new" : "old"; long deadline = System.nanoTime() + TimeUnit.MINUTES.toNanos(1); int attempt = 0; while (isClusterManagerOnNewVersion() != newVersion) { if (System.nanoTime() > deadline) { fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute"); } String cmName = getClusterManagerNodeName(); logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName); Request exclude = new Request("POST", "/_cluster/voting_config_exclusions"); exclude.addParameter("node_names", cmName); exclude.addParameter("timeout", "30s"); assertOK(client().performRequest(exclude)); // Wait for a different node to become CM assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())); // Clear exclusion immediately so the node stays in the cluster clearVotingConfigExclusions(); } logger.info("Cluster manager is on {} version: [{}]", versionLabel, getClusterManagerNodeName()); } private String getClusterManagerNodeName() throws IOException { Map<String, Object> clusterState = entityAsMap(client().performRequest(new Request("GET", "_cluster/state"))); String cmNodeId = (String) clusterState.get("master_node"); Map<String, Object> nodesInfo = entityAsMap(client().performRequest(new Request("GET", "_nodes"))); Map<String, Object> nodes = (Map<String, Object>) nodesInfo.get("nodes"); Map<String, Object> cmNode = (Map<String, Object>) nodes.get(cmNodeId); return (String) cmNode.get("name"); } Race Condition* In `ensureClusterManagerVersion`, after `assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName()))` confirms a new CM was elected and exclusions are cleared, the next call to `isClusterManagerOnNewVersion()` at the top of the while loop may observe a transient state. There is a potential race between clearing exclusions and the cluster stabilizing, which could cause the loop to exclude the wrong node. while (isClusterManagerOnNewVersion() != newVersion) { if (System.nanoTime() > deadline) { fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute"); } String cmName = getClusterManagerNodeName(); logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName); Request exclude = new Request("POST", "/_cluster/voting_config_exclusions"); exclude.addParameter("node_names", cmName); exclude.addParameter("timeout", "30s"); assertOK(client().performRequest(exclude)); // Wait for a different node to become CM assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())); // Clear exclusion immediately so the node stays in the cluster clearVotingConfigExclusions(); } Transient Settings Leak `makeClusterStateChange` sets a transient cluster setting (`cluster.routing.allocation.exclude._name`) but never cleans it up. These transient settings will persist for the remainder of the test run and could interfere with subsequent test phases or other tests. private void makeClusterStateChange(String suffix) throws IOException { Request putSettings = new Request("PUT", "_cluster/settings"); putSettings.setJsonEntity(String.format(Locale.ROOT, """ { "transient": { "cluster.routing.allocation.exclude._name": "nonexistent_node_%s" } }""", suffix)); assertOK(client().performRequest(putSettings)); }

github-actions · 2026-03-18T19:32:38Z

PR Code Suggestions ✨

Latest suggestions up to f1489d0
Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Ensure voting exclusions are always cleared If `clearVotingConfigExclusions()` throws an exception after a voting exclusion is added, the excluded node will remain excluded permanently, potentially leaving the cluster in a broken state for subsequent test steps. The exclusion cleanup should be placed in a `try/finally` block to guarantee it always runs. qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [126-143] while (isClusterManagerOnNewVersion() != newVersion) { if (System.nanoTime() > deadline) { fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute"); } String cmName = getClusterManagerNodeName(); logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName); Request exclude = new Request("POST", "/_cluster/voting_config_exclusions"); exclude.addParameter("node_names", cmName); exclude.addParameter("timeout", "30s"); assertOK(client().performRequest(exclude)); - // Wait for a different node to become CM - assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())); - - // Clear exclusion immediately so the node stays in the cluster - clearVotingConfigExclusions(); + try { + // Wait for a different node to become CM + assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())); + } finally { + // Clear exclusion so the node stays in the cluster + clearVotingConfigExclusions(); + } } Suggestion importance[1-10]: 6 __ Why: Valid defensive programming suggestion - wrapping `assertBusy` in a `try/finally` ensures `clearVotingConfigExclusions()` is always called even if the assertion fails, preventing the cluster from being left in a broken state. This is a real concern in test infrastructure.	Low
Possible issue	Guard against null cluster-manager node lookup If `clusterManagerNodeId` is not found in the `nodes` map (e.g., due to a race condition between fetching cluster state and nodes info), `cmNode` will be `null`, causing a `NullPointerException`. A null check should be added before accessing `cmNode`. qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [113-115] Map<String, Object> cmNode = (Map<String, Object>) nodes.get(clusterManagerNodeId); + if (cmNode == null) { + throw new IOException("Cluster manager node [" + clusterManagerNodeId + "] not found in nodes info"); + } Version cmVersion = Version.fromString((String) cmNode.get("version")); return cmVersion.after(UPGRADE_FROM_VERSION); Suggestion importance[1-10]: 5 __ Why: A valid defensive check - `cmNode` could be `null` due to a race condition between fetching cluster state and nodes info, which would cause a `NullPointerException`. Adding a null check with a meaningful error message improves robustness and debuggability.	Low
General	Clean up transient cluster settings after test The transient routing exclusion settings set by `makeClusterStateChange` accumulate across calls and are never cleaned up. After the test, `cluster.routing.allocation.exclude._name` will still be set to `"nonexistent_node_new_cm"`, which could interfere with subsequent test phases or other tests. The transient setting should be cleared after each use or at the end of the block. qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [79-87] ensureClusterManagerVersion(false); makeClusterStateChange("old_cm"); ensureAllNodesHealthy(); verifyClusterState(); ensureClusterManagerVersion(true); makeClusterStateChange("new_cm"); ensureAllNodesHealthy(); verifyClusterState(); + // Clean up transient routing exclusion settings + Request clearSettings = new Request("PUT", "_cluster/settings"); + clearSettings.setJsonEntity(""" + { + "transient": { + "cluster.routing.allocation.exclude._name": null + } + }"""); + assertOK(client().performRequest(clearSettings)); + Suggestion importance[1-10]: 5 __ Why: The transient `cluster.routing.allocation.exclude._name` setting set by `makeClusterStateChange` is never cleared, which could interfere with subsequent test phases. Cleaning up the setting after use is good test hygiene and prevents potential test pollution.	Low

Previous suggestions

Suggestions up to commit 892518d

Category	Suggestion	Impact
Possible issue	Ensure voting exclusions are always cleared If `clearVotingConfigExclusions()` throws an exception after a voting exclusion is added, the excluded node will remain excluded permanently, potentially leaving the cluster in a broken state for subsequent test steps. The exclusion cleanup should be placed in a `try/finally` block to guarantee it always runs. qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [126-143] while (isClusterManagerOnNewVersion() != newVersion) { if (System.nanoTime() > deadline) { fail("Failed to get cluster-manager on " + versionLabel + " version after 1 minute"); } String cmName = getClusterManagerNodeName(); logger.info("Attempt {} to get {} version CM, excluding current CM [{}]", ++attempt, versionLabel, cmName); Request exclude = new Request("POST", "/_cluster/voting_config_exclusions"); exclude.addParameter("node_names", cmName); exclude.addParameter("timeout", "30s"); assertOK(client().performRequest(exclude)); - // Wait for a different node to become CM - assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())); - - // Clear exclusion immediately so the node stays in the cluster - clearVotingConfigExclusions(); + try { + // Wait for a different node to become CM + assertBusy(() -> assertNotEquals(cmName, getClusterManagerNodeName())); + } finally { + // Clear exclusion so the node stays in the cluster + clearVotingConfigExclusions(); + } } Suggestion importance[1-10]: 7 __ Why: This is a valid defensive improvement - if `assertBusy` throws, the voting exclusion would remain, potentially breaking subsequent test steps. Using `try/finally` ensures cleanup always happens.	Medium
Possible issue	Guard against null cluster-manager node lookup If `cmNodeId` is not found in the `nodes` map (e.g., due to a race condition where the CM changes between the two API calls), `cmNode` will be `null` and `cmNode.get("name")` will throw a `NullPointerException`. A null check with a meaningful error message should be added. qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [152-153] Map<String, Object> cmNode = (Map<String, Object>) nodes.get(cmNodeId); + if (cmNode == null) { + throw new IllegalStateException("Cluster manager node [" + cmNodeId + "] not found in nodes info response"); + } return (String) cmNode.get("name"); Suggestion importance[1-10]: 5 __ Why: A race condition between the two API calls could result in `cmNode` being `null`, causing a `NullPointerException`. Adding a null check with a meaningful error message improves robustness and debuggability.	Low
General	Clean up transient settings after use The method accumulates transient `cluster.routing.allocation.exclude._name` settings across calls (once with `"old_cm"` and once with `"new_cm"`), but never cleans them up. These leftover transient settings could interfere with subsequent test phases or other tests. The setting should be cleared after each use or at the end of the test. qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/RemotePublicationClusterStateIT.java [160-169] private void makeClusterStateChange(String suffix) throws IOException { Request putSettings = new Request("PUT", "_cluster/settings"); putSettings.setJsonEntity(String.format(Locale.ROOT, """ { "transient": { "cluster.routing.allocation.exclude._name": "nonexistent_node_%s" } }""", suffix)); assertOK(client().performRequest(putSettings)); + + // Clean up the transient setting after the state change is published + Request clearSettings = new Request("PUT", "_cluster/settings"); + clearSettings.setJsonEntity(""" + { + "transient": { + "cluster.routing.allocation.exclude._name": null + } + }"""); + assertOK(client().performRequest(clearSettings)); } Suggestion importance[1-10]: 6 __ Why: The transient `cluster.routing.allocation.exclude._name` settings accumulate across calls and are never cleaned up, which could interfere with `ensureAllNodesHealthy()` or subsequent test phases. Cleaning up after each call is a valid improvement.	Low

github-actions · 2026-03-18T20:43:32Z

❌ Gradle check result for 892518d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

The RemotePublicationClusterStateIT test was not deterministically exercising the case where a new-version cluster-manager writes remote cluster state that old-version nodes must deserialize. This meant backwards-incompatible serialization changes could be merged without being caught, since the test would pass whenever an old-version node happened to win the cluster-manager election. In the mixed-version (one-third upgraded) phase, the test now explicitly tests both directions: 1. Old CM writes state, new nodes read from remote store 2. New CM writes state, old nodes read from remote store To force a specific version to be cluster-manager, the test uses the voting config exclusions API to repeatedly exclude the current CM until a node of the desired version wins the election. Exclusions are cleared immediately after each re-election so no node leaves the cluster. Also changes the remote test cluster's dependency on the non-remote test cluster from dependsOn to mustRunAfter, so the remote tests can be run independently without first running the full non-remote suite. Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2026-03-19T16:31:13Z

Persistent review updated to latest commit f1489d0

github-actions · 2026-03-19T17:45:35Z

❌ Gradle check result for f1489d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-19T23:14:15Z

❌ Gradle check result for f1489d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

andrross · 2026-03-25T20:36:32Z

Closing as forward compatibility is explicitly not supported here. See #20948

andrross added the skip-changelog label Mar 18, 2026

andrross mentioned this pull request Mar 18, 2026

[BUG] Remote cluster state compatibility failures #20910

Closed

andrross force-pushed the RemotePublicationClusterStateIT-deterministic-failure branch from 892518d to f1489d0 Compare March 19, 2026 16:29

andrross closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test remote state BWC in both CM version directions#20912

Test remote state BWC in both CM version directions#20912
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:RemotePublicationClusterStateIT-deterministic-failure

andrross commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

andrross commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrross commented Mar 18, 2026

Related Issues

Check List

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit f1489d0)

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

andrross commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 18, 2026 •

edited

Loading

(Review updated until commit `f1489d0`)

github-actions bot commented Mar 18, 2026 •

edited

Loading