Skip to content

Rebalancing Fix : Handle All Nodes Disabled Case #121

Open
ngngwr wants to merge 3 commits intodevfrom
ngangwar/rebalance_all_nodes_disabled
Open

Rebalancing Fix : Handle All Nodes Disabled Case #121
ngngwr wants to merge 3 commits intodevfrom
ngangwar/rebalance_all_nodes_disabled

Conversation

@ngngwr
Copy link
Collaborator

@ngngwr ngngwr commented Feb 16, 2026

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Description

SCENARIO

  • 3 instances
  • All instances disabled

Observation (Without Fix)

  • Last replica was not dropped from IDEALSTATE and EXTERNALVIEW

Reason:

  • The checkBestPossibleStateCalculation method treats an empty preference list as a rebalance failure. For our case, empty preference list is a legitimate result — all nodes are disabled, so there should be no assignment.
  • No best possible state is computed for the resource (no DROPPED/OFFLINE assignments)
  • No messages are generated by MessageGenerationPhase (nothing to transition)
  • No state transition messages are sent to localhost_12000
  • The participant stays in LEADER forever — it never receives a message to go to STANDBY → OFFLINE
  • The External View (which reflects actual current state from participants) keeps showing LEADER

BEFORE
image

AFTER

image
  1. New Change(Commit 6c5f23b): Enhanced checkBestPossibleStateCalculation() to support the "all nodes disabled" scenario
    - Added hasCurrentStateForResource() helper method
    - Allows empty preference lists when existing replicas need cleanup (all nodes disabled)
    - Distinguishes between "not initialized" (reject) vs "all disabled" (allow for cleanup)
  2. Test Coverage (Commit 2cb9cae): Added comprehensive unit tests in TestBestPossibleStateCalcStage
    - Test case for all nodes disabled with current state (should allow)
    - Test case for uninitialized resource (should reject)

Code Changes:

  • Modified: helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java
    • Enhanced checkBestPossibleStateCalculation() method signature to accept Resource and CurrentStateOutput
    • Added hasCurrentStateForResource() helper method
    • Added logic to distinguish between three scenarios:
      i. All lists empty + current state exists → Allow (all nodes disabled cleanup)
      ii. Some lists empty + maxPartitionsPerInstance set → Allow (capacity-limited)
      iii. Some lists empty + no maxPartitionsPerInstance → Reject (inconsistent state)
  • Modified: helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestAutoRebalancePartitionLimit.java
    • Enhanced test timing to use TestHelper.DEFAULT_REBALANCE_PROCESSING_WAIT_TIME
    • Added Thread.sleep() calls to ensure controller processes state changes
    • Increased verifier timeout from 10s to 60s for better CI stability
  • Added: Tests in helix-core/src/test/java/org/apache/helix/controller/stages/TestBestPossibleStateCalcStage.java
    • Unit tests for "all nodes disabled" scenario
    • Unit tests for uninitialized resource rejection

Tests

  • TestBestPossibleStateCalcStage.testCheckBestPossibleStateCalculation_AllNodesDisabled_WithCurrentState() - Validates "all nodes disabled" feature
  • TestBestPossibleStateCalcStage.testCheckBestPossibleStateCalculation_ResourceNotInitialized() - Ensures uninitialized resources are rejected

@ngngwr ngngwr changed the title [Draft] Ngangwar/rebalance all nodes disabled Rebalancing Fix : Handle All Nodes Disabled Case Feb 17, 2026
if (idealState.getRebalanceMode() == IdealState.RebalanceMode.FULL_AUTO && !idealState
.getReplicas().equals("0")) {
Map<String, List<String>> preferenceLists = idealState.getPreferenceLists();
if (preferenceLists == null || preferenceLists.isEmpty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the preference list be null?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this end up in a situation where rebalancer could fail to calculate preference list due to some other issue and we end up dropping all partitions which could be a risky behaviour?

}
// Some but not all lists empty: this is valid when maxPartitionsPerInstance limits capacity.
// Only reject when maxPartitionsPerInstance is NOT set and we have inconsistent empty lists.
if (emptyListCount > 0 && idealState.getMaxPartitionsPerInstance() > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm the default value for getMaxPartitionsPerInstance once.

Copy link
Collaborator

@laxman-ch laxman-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants