Fix flaky ShardsLimitAllocationDeciderIT by andrross · Pull Request #20856 · opensearch-project/OpenSearch

andrross · 2026-03-12T20:22:01Z

RemoteShardsBalancer.balance() only rebalances by primary shard count, not total shard count. In tight-capacity scenarios where cluster-wide shard limits leave minimal spare slots, this prevents the balancer from redistributing replicas to free space on other nodes, leaving assignable shards unassigned. I believe this is the cause of the flakiness in ShardsLimitAllocationDeciderIT.

This change introduces a unit test that deliberately targets the tightly packed scenario. The RemoteShardsBalancer variant of the new test will reliably fail if run repeated, though there is still some non-determinism. Regardless, it is muted along with the exiting integration test. If/when we improve the intelligence of RemoteShardsBalancer then we can unmute these tests. For now the tests exist to document this known limitation.

Related Issues

Resolves #19726

Check List

Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-03-12T20:23:15Z

PR Reviewer Guide 🔍

(Review updated until commit `6e0a3dc`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Mute flaky integration test for remote/warm scenario Relevant files: server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java Sub-PR theme: Add unit tests documenting tight-capacity balancer limitation Relevant files: server/src/test/java/org/opensearch/cluster/routing/allocation/ShardsBalancerTightCapacityTests.java
⚡ Recommended focus areas for review Convergence Limit The `allocateToConvergence` method caps iterations at 10. If the allocation scenario requires more than 10 rounds to converge, the assertion `assertTrue("Expected no shards to be INITIALIZING", ...)` will fail with a misleading message rather than indicating a convergence timeout. Consider adding a more descriptive failure message or increasing the iteration cap if needed. private ClusterState allocateToConvergence(AllocationService service, ClusterState clusterState) { clusterState = service.reroute(clusterState, "reroute"); int iterations = 0; while (clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty() == false && iterations < 10) { clusterState = startInitializingShardsAndReroute(service, clusterState); iterations++; } assertTrue("Expected no shards to be INITIALIZING", clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty()); return clusterState; } Remote Index Type Using `IndexModule.Type.REMOTE_SNAPSHOT` as the store type for remote nodes may not accurately represent a typical remote-capable (warm) index scenario. `REMOTE_SNAPSHOT` is typically for searchable snapshots, not general remote/warm indices. This could cause the test to not correctly exercise the `RemoteShardsBalancer` path as intended. if (remote) { sb.put(IndexModule.INDEX_STORE_TYPE_SETTING.getKey(), IndexModule.Type.REMOTE_SNAPSHOT.getSettingsKey()); }

andrross · 2026-03-12T20:23:30Z

FYI @Gagan6164 @gbbafna

I believe the flakiness here was introduced by #19532 due to the fact that RemoteShardsBalancer is not as sophisticated as LocalShardsBalancer. See my findings here: #19726 (comment)

github-actions · 2026-03-12T20:23:35Z

PR Code Suggestions ✨

Latest suggestions up to 6e0a3dc

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Improve convergence failure message clarity If the loop exits due to hitting the iteration limit (10) rather than convergence, the `assertTrue` will fail with a misleading message. Consider adding a guard or a more descriptive failure message that indicates the iteration limit was reached, so it's clear the balancer failed to converge rather than an unexpected state occurred. server/src/test/java/org/opensearch/cluster/routing/allocation/ShardsBalancerTightCapacityTests.java [122-126] while (clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty() == false && iterations < 10) { clusterState = startInitializingShardsAndReroute(service, clusterState); iterations++; } -assertTrue("Expected no shards to be INITIALIZING", clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty()); +assertTrue( + "Expected no shards to be INITIALIZING after " + iterations + " iterations (max 10); balancer may not have converged", + clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty() +); Suggestion importance[1-10]: 4 __ Why: The suggestion improves the assertion failure message to include iteration count, making it clearer when the balancer fails to converge vs. an unexpected state. This is a minor readability/debuggability improvement with limited functional impact.	Low
General	Avoid mutable builder reuse risk in index settings The `indexSettings(boolean remote)` method returns a `Settings.Builder`, and calling `.put(indexLimitKey, 1)` on it mutates the builder in place. Since `indexSettings` creates a new builder each time, this is safe here, but the method signature returning `Settings.Builder` instead of `Settings` is unconventional and could lead to accidental mutation if reused. Consider returning `Settings` from `indexSettings` and using `Settings.builder().put(...).put(indexLimitKey, 1).build()` at the call site for clarity. server/src/test/java/org/opensearch/cluster/routing/allocation/ShardsBalancerTightCapacityTests.java [80] -mb.put(IndexMetadata.builder("test1").settings(indexSettings(remote).put(indexLimitKey, 1)).numberOfShards(3).numberOfReplicas(1)); +mb.put(IndexMetadata.builder("test1") + .settings(Settings.builder().put(indexSettings(remote)).put(indexLimitKey, 1)) + .numberOfShards(3).numberOfReplicas(1)); Suggestion importance[1-10]: 2 __ Why: The concern about mutable builder reuse is largely theoretical since `indexSettings` creates a new builder each call. The `improved_code` also changes the method signature indirectly without showing the full refactor, making this suggestion incomplete and of marginal value.	Low

Previous suggestions

Suggestions up to commit 4e26181

Category	Suggestion	Impact
Possible issue	Assert convergence after iteration limit The iteration limit of 10 may be too low for complex allocation scenarios and could cause the test to return a partially-converged state without any indication. If the loop exits due to hitting the limit rather than full convergence, the test may produce misleading results. Consider adding an assertion or a warning after the loop to detect this case. server/src/test/java/org/opensearch/cluster/routing/allocation/ShardsBalancerTightCapacityTests.java [122-125] while (clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty() == false && iterations < 10) { clusterState = startInitializingShardsAndReroute(service, clusterState); iterations++; } +assertTrue("Allocation did not converge within iteration limit", clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty()); Suggestion importance[1-10]: 5 __ Why: Adding an assertion after the loop is a valid improvement to detect non-convergence, but the `testTightCapacityConvergenceWithRemoteShards` test is explicitly marked `@AwaitsFix` and expected to not converge, so adding this assertion would break that test. The suggestion is only applicable to the local shards test path.	Low

Suggestions up to commit e813254

Category	Suggestion	Impact
Possible issue	Guard against incomplete allocation convergence The iteration limit of 10 may be too low for complex allocation scenarios, causing the test to return a partially converged state without any indication that convergence was not reached. Consider adding an assertion or a warning after the loop to detect when the limit is hit, or increase the limit to ensure full convergence. server/src/test/java/org/opensearch/cluster/routing/allocation/ShardsBalancerTightCapacityTests.java [129-132] -while (clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty() == false && iterations < 10) { +while (clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty() == false && iterations < 100) { clusterState = startInitializingShardsAndReroute(service, clusterState); iterations++; } +assertFalse("Allocation did not converge within iteration limit", + clusterState.getRoutingNodes().shardsWithState(INITIALIZING).isEmpty() == false); Suggestion importance[1-10]: 5 __ Why: Adding a convergence check after the loop is a valid improvement to detect when the iteration limit is hit, but the current limit of 10 is likely sufficient for the described scenario. The suggestion to increase to 100 and add an assertion improves test robustness, though the impact is moderate.	Low
Possible issue	Return built Settings instead of builder The method returns a `Settings.Builder` instead of a built `Settings` object, which means callers must chain `.build()` or additional `.put()` calls. However, when used in `IndexMetadata.builder(...).settings(...)`, if the method signature expects a `Settings` object, this could cause a compile error or unintended behavior. Verify that the `settings()` method on `IndexMetadata.Builder` accepts a `Settings.Builder`, otherwise return `sb.build()`. server/src/test/java/org/opensearch/cluster/routing/allocation/ShardsBalancerTightCapacityTests.java [118-124] -private Settings.Builder indexSettings(boolean remote) { +private Settings indexSettings(boolean remote) { final Settings.Builder sb = settings(Version.CURRENT); if (remote) { sb.put(IndexModule.INDEX_STORE_TYPE_SETTING.getKey(), IndexModule.Type.REMOTE_SNAPSHOT.getSettingsKey()); } - return sb; + return sb.build(); } Suggestion importance[1-10]: 4 __ Why: The suggestion raises a valid concern about returning `Settings.Builder` vs `Settings`, but looking at the usage in lines 81-87, the callers chain `.put(indexLimitKey, 1)` on the returned builder before passing to `.settings()`, so returning a builder is intentional and necessary. The suggestion would break the existing chaining pattern.	Low

github-actions · 2026-03-12T20:37:28Z

❌ Gradle check result for e813254: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-12T21:22:43Z

Persistent review updated to latest commit 4e26181

RemoteShardsBalancer.balance() only rebalances by primary shard count, not total shard count. In tight-capacity scenarios where cluster-wide shard limits leave minimal spare slots, this prevents the balancer from redistributing replicas to free space on other nodes, leaving assignable shards unassigned. I believe this is the cause of the flakiness in ShardsLimitAllocationDeciderIT. This change introduces a unit test that deliberately targets the tightly packed scenario. The RemoteShardsBalancer variant of the new test will reliably fail if run repeated, though there is still some non-determinism. Regardless, it is muted along with the exiting integration test. If/when we improve the intelligence of RemoteShardsBalancer then we can unmute these tests. For now the tests exist to document this known limitation. Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2026-03-12T21:25:12Z

Persistent review updated to latest commit 6e0a3dc

github-actions · 2026-03-12T22:33:25Z

❌ Gradle check result for 6e0a3dc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-13T01:37:49Z

❌ Gradle check result for 6e0a3dc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-13T17:42:21Z

✅ Gradle check result for 6e0a3dc: SUCCESS

codecov · 2026-03-13T17:43:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.27%. Comparing base (4b9021f) to head (6e0a3dc).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20856      +/-   ##
============================================
- Coverage     73.32%   73.27%   -0.06%     
+ Complexity    72267    72230      -37     
============================================
  Files          5795     5795              
  Lines        330056   330057       +1     
  Branches      47643    47643              
============================================
- Hits         242030   241859     -171     
- Misses        68584    68777     +193     
+ Partials      19442    19421      -21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cwperks

Nice to see a unit test setup for this. This looks like a test we can parameterize in the future to test more scenarios more extensively.

For my own knowledge, I see that we only have the warm node role when remote is set to true. Is that because it only makes sense in that context? Any harm in having the list of node roles be identical for both cases but it's a noop when remote is false?

andrross · 2026-03-13T20:25:07Z

Any harm in having the list of node roles be identical for both cases but it's a noop when remote is false?

That may be possible. The key difference is that a warm node is configured to use a portion of its disk to swap data from remote, which is a requirement to host the remote shards.

RemoteShardsBalancer.balance() only rebalances by primary shard count, not total shard count. In tight-capacity scenarios where cluster-wide shard limits leave minimal spare slots, this prevents the balancer from redistributing replicas to free space on other nodes, leaving assignable shards unassigned. I believe this is the cause of the flakiness in ShardsLimitAllocationDeciderIT. This change introduces a unit test that deliberately targets the tightly packed scenario. The RemoteShardsBalancer variant of the new test will reliably fail if run repeated, though there is still some non-determinism. Regardless, it is muted along with the exiting integration test. If/when we improve the intelligence of RemoteShardsBalancer then we can unmute these tests. For now the tests exist to document this known limitation. Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross requested a review from a team as a code owner March 12, 2026 20:22

andrross added the skip-changelog label Mar 12, 2026

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Mar 12, 2026

andrross force-pushed the shards-tight-capacity-test branch from e813254 to 4e26181 Compare March 12, 2026 21:21

andrross force-pushed the shards-tight-capacity-test branch from 4e26181 to 6e0a3dc Compare March 12, 2026 21:23

opensearch-ci-bot mentioned this pull request Mar 13, 2026

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreBackpressureAndResiliencyIT #17936

Open

cwperks approved these changes Mar 13, 2026

View reviewed changes

andrross merged commit 4d6ccf1 into opensearch-project:main Mar 13, 2026
41 of 45 checks passed

andrross deleted the shards-tight-capacity-test branch March 13, 2026 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky ShardsLimitAllocationDeciderIT#20856

Fix flaky ShardsLimitAllocationDeciderIT#20856
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:shards-tight-capacity-test

andrross commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

andrross commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

codecov bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

cwperks left a comment •

edited

Loading

Uh oh!

Uh oh!

andrross commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrross commented Mar 12, 2026

Related Issues

Check List

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 6e0a3dc)

Uh oh!

andrross commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

codecov bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cwperks left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrross commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 12, 2026 •

edited

Loading

(Review updated until commit `6e0a3dc`)

github-actions bot commented Mar 12, 2026 •

edited

Loading

codecov bot commented Mar 13, 2026 •

edited

Loading

cwperks left a comment •

edited

Loading