Skip to content

fix(shard-distributor): add an immediate retry after a failure of the rebalancing loop#7721

Merged
arzonus merged 1 commit intocadence-workflow:masterfrom
arzonus:fix-rebalance-loop-error
Feb 19, 2026
Merged

fix(shard-distributor): add an immediate retry after a failure of the rebalancing loop#7721
arzonus merged 1 commit intocadence-workflow:masterfrom
arzonus:fix-rebalance-loop-error

Conversation

@arzonus
Copy link
Contributor

@arzonus arzonus commented Feb 19, 2026

What changed?

  • Added an immediate retry after a failure of the rebalancing loop

Why?
We observed that the trigger for the rebalancing loop is delayed when a previous rebalance fails, because there are no other triggers except a state change or a periodic time update. We expect that the rebalancing loop should be triggered asap in this case, but with a cooldown to avoid rebalance storms if the underlying issue is persistent.

How did you test it?

  • Run unit tests locally
  • Run on dev cluster

Potential risks
N/A

Release notes
N/A

Documentation Changes
N/A


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

Comment on lines 190 to 194
// Perform an initial rebalance on startup.
err := p.rebalanceShards(ctx)
if err != nil {
p.logger.Error("initial rebalance failed", tag.Error(err))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Initial rebalance failure doesn't trigger immediate retry

When the initial rebalance at line 191 fails, the error is logged but no retry is queued to triggerChan. The new retry-on-failure logic (lines 220-226) only applies to failures within the main loop, not the initial attempt.

This means if the initial rebalance fails, the system must wait until the first periodic tick (cfg.Period, default 1s) or a state change to retry. This is likely acceptable since the periodic trigger fires quickly, but it's inconsistent with the PR's stated goal of "immediate retry after failure."

If desired, you could queue a retry after the initial failure:

if err != nil {
    p.logger.Error("initial rebalance failed", tag.Error(err))
    select {
    case triggerChan <- "Initial rebalance failed":
    default:
    }
}

Was this helpful? React with 👍 / 👎

@arzonus arzonus force-pushed the fix-rebalance-loop-error branch from 0e64e6a to e2f5c18 Compare February 19, 2026 11:59
@gitar-bot
Copy link

gitar-bot bot commented Feb 19, 2026

Code Review 👍 Approved with suggestions 1 resolved / 2 findings

The critical race condition (send on closed channel) has been resolved by removing defer close(triggerChan). The retry-on-failure mechanism is well-designed with proper non-blocking sends and cooldown enforcement. One minor pre-existing suggestion about initial rebalance failure retry remains open.

💡 Edge Case: Initial rebalance failure doesn't trigger immediate retry

📄 service/sharddistributor/leader/process/processor.go:190-194

When the initial rebalance at line 191 fails, the error is logged but no retry is queued to triggerChan. The new retry-on-failure logic (lines 220-226) only applies to failures within the main loop, not the initial attempt.

This means if the initial rebalance fails, the system must wait until the first periodic tick (cfg.Period, default 1s) or a state change to retry. This is likely acceptable since the periodic trigger fires quickly, but it's inconsistent with the PR's stated goal of "immediate retry after failure."

If desired, you could queue a retry after the initial failure:

if err != nil {
    p.logger.Error("initial rebalance failed", tag.Error(err))
    select {
    case triggerChan <- "Initial rebalance failed":
    default:
    }
}
✅ 1 resolved
Bug: Race: send on closed triggerChan causes panic

📄 service/sharddistributor/leader/process/processor.go:188 📄 service/sharddistributor/leader/process/processor.go:249-254
Moving channel ownership from rebalanceTriggeringLoop to runRebalancingLoop introduces a send-on-closed-channel panic.

Scenario: When the context is cancelled, both goroutines race on ctx.Done(). If runRebalancingLoop selects ctx.Done() first and returns, defer close(triggerChan) fires. Meanwhile, rebalanceTriggeringLoop may still be running — Go's select is non-deterministic, so it could select a ticker tick or an updateChan message before ctx.Done(). When tryTriggerRebalancing then executes triggerChan <- reason (line 251), it panics with send on closed channel.

Previously, this was safe because rebalanceTriggeringLoop owned and closed the channel itself — writes could never happen after close.

Suggested fix: Remove defer close(triggerChan) and instead synchronize the goroutine's exit before closing, or simply don't close the channel at all (it will be GC'd when both goroutines exit and no references remain). For example:

func (p *namespaceProcessor) runRebalancingLoop(ctx context.Context) {
    triggerChan := make(chan string, 1)
    // No defer close — channel will be GC'd when both goroutines exit.
    
    // ... rest of the function ...
}

If you need to close the channel (e.g., for signaling), you must wait for the triggering goroutine to exit first using a sync.WaitGroup or similar mechanism.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Member

@jakobht jakobht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Great change.

I think we should refactor this goroutine the same way that was done here #7697

But in a different PR :)

@arzonus
Copy link
Contributor Author

arzonus commented Feb 19, 2026

Looks good! Great change.

I think we should refactor this goroutine the same way that was done here #7697

But in a different PR :)

Yeah, I also think it is worth separating the rebalancer runner and the rebalance itself, it should make testing easier.

@arzonus arzonus merged commit f551af0 into cadence-workflow:master Feb 19, 2026
42 checks passed
@arzonus arzonus deleted the fix-rebalance-loop-error branch February 19, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants