Fix flaky ShardsLimitAllocationDeciderIT#20856
Fix flaky ShardsLimitAllocationDeciderIT#20856andrross merged 1 commit intoopensearch-project:mainfrom
Conversation
PR Reviewer Guide 🔍(Review updated until commit 6e0a3dc)Here are some key observations to aid the review process:
|
|
FYI @Gagan6164 @gbbafna I believe the flakiness here was introduced by #19532 due to the fact that RemoteShardsBalancer is not as sophisticated as LocalShardsBalancer. See my findings here: #19726 (comment) |
PR Code Suggestions ✨Latest suggestions up to 6e0a3dc Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 4e26181
Suggestions up to commit e813254
|
|
❌ Gradle check result for e813254: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
e813254 to
4e26181
Compare
|
Persistent review updated to latest commit 4e26181 |
RemoteShardsBalancer.balance() only rebalances by primary shard count, not total shard count. In tight-capacity scenarios where cluster-wide shard limits leave minimal spare slots, this prevents the balancer from redistributing replicas to free space on other nodes, leaving assignable shards unassigned. I believe this is the cause of the flakiness in ShardsLimitAllocationDeciderIT. This change introduces a unit test that deliberately targets the tightly packed scenario. The RemoteShardsBalancer variant of the new test will reliably fail if run repeated, though there is still some non-determinism. Regardless, it is muted along with the exiting integration test. If/when we improve the intelligence of RemoteShardsBalancer then we can unmute these tests. For now the tests exist to document this known limitation. Signed-off-by: Andrew Ross <andrross@amazon.com>
4e26181 to
6e0a3dc
Compare
|
Persistent review updated to latest commit 6e0a3dc |
|
❌ Gradle check result for 6e0a3dc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 6e0a3dc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20856 +/- ##
============================================
- Coverage 73.32% 73.27% -0.06%
+ Complexity 72267 72230 -37
============================================
Files 5795 5795
Lines 330056 330057 +1
Branches 47643 47643
============================================
- Hits 242030 241859 -171
- Misses 68584 68777 +193
+ Partials 19442 19421 -21 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Nice to see a unit test setup for this. This looks like a test we can parameterize in the future to test more scenarios more extensively.
For my own knowledge, I see that we only have the warm node role when remote is set to true. Is that because it only makes sense in that context? Any harm in having the list of node roles be identical for both cases but it's a noop when remote is false?
That may be possible. The key difference is that a warm node is configured to use a portion of its disk to swap data from remote, which is a requirement to host the remote shards. |
RemoteShardsBalancer.balance() only rebalances by primary shard count, not total shard count. In tight-capacity scenarios where cluster-wide shard limits leave minimal spare slots, this prevents the balancer from redistributing replicas to free space on other nodes, leaving assignable shards unassigned. I believe this is the cause of the flakiness in ShardsLimitAllocationDeciderIT. This change introduces a unit test that deliberately targets the tightly packed scenario. The RemoteShardsBalancer variant of the new test will reliably fail if run repeated, though there is still some non-determinism. Regardless, it is muted along with the exiting integration test. If/when we improve the intelligence of RemoteShardsBalancer then we can unmute these tests. For now the tests exist to document this known limitation. Signed-off-by: Andrew Ross <andrross@amazon.com>
RemoteShardsBalancer.balance() only rebalances by primary shard count, not total shard count. In tight-capacity scenarios where cluster-wide shard limits leave minimal spare slots, this prevents the balancer from redistributing replicas to free space on other nodes, leaving assignable shards unassigned. I believe this is the cause of the flakiness in ShardsLimitAllocationDeciderIT.
This change introduces a unit test that deliberately targets the tightly packed scenario. The RemoteShardsBalancer variant of the new test will reliably fail if run repeated, though there is still some non-determinism. Regardless, it is muted along with the exiting integration test. If/when we improve the intelligence of RemoteShardsBalancer then we can unmute these tests. For now the tests exist to document this known limitation.
Related Issues
Resolves #19726
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.