fix(shard-distributor): add an immediate retry after a failure of the rebalancing loop by arzonus · Pull Request #7721 · cadence-workflow/cadence

arzonus · 2026-02-19T11:52:07Z

What changed?

Added an immediate retry after a failure of the rebalancing loop

Why?
We observed that the trigger for the rebalancing loop is delayed when a previous rebalance fails, because there are no other triggers except a state change or a periodic time update. We expect that the rebalancing loop should be triggered asap in this case, but with a cooldown to avoid rebalance storms if the underlying issue is persistent.

How did you test it?

Run unit tests locally
Run on dev cluster

Potential risks
N/A

Release notes
N/A

Documentation Changes
N/A

Reviewer Validation

PR Description Quality (check these before reviewing code):

service/sharddistributor/leader/process/processor.go

gitar-bot · 2026-02-19T11:55:31Z

service/sharddistributor/leader/process/processor.go

 	// Perform an initial rebalance on startup.
 	err := p.rebalanceShards(ctx)
 	if err != nil {
 		p.logger.Error("initial rebalance failed", tag.Error(err))
 	}


💡 Edge Case: Initial rebalance failure doesn't trigger immediate retry

When the initial rebalance at line 191 fails, the error is logged but no retry is queued to triggerChan. The new retry-on-failure logic (lines 220-226) only applies to failures within the main loop, not the initial attempt.

This means if the initial rebalance fails, the system must wait until the first periodic tick (cfg.Period, default 1s) or a state change to retry. This is likely acceptable since the periodic trigger fires quickly, but it's inconsistent with the PR's stated goal of "immediate retry after failure."

If desired, you could queue a retry after the initial failure:

if err != nil { p.logger.Error("initial rebalance failed", tag.Error(err)) select { case triggerChan <- "Initial rebalance failed": default: } }

_{Was this helpful? React with 👍 / 👎}

gitar-bot · 2026-02-19T12:04:01Z

Code Review 👍 Approved with suggestions 1 resolved / 2 findings

The critical race condition (send on closed channel) has been resolved by removing defer close(triggerChan). The retry-on-failure mechanism is well-designed with proper non-blocking sends and cooldown enforcement. One minor pre-existing suggestion about initial rebalance failure retry remains open.

💡 Edge Case: Initial rebalance failure doesn't trigger immediate retry

📄 service/sharddistributor/leader/process/processor.go:190-194

When the initial rebalance at line 191 fails, the error is logged but no retry is queued to triggerChan. The new retry-on-failure logic (lines 220-226) only applies to failures within the main loop, not the initial attempt.

This means if the initial rebalance fails, the system must wait until the first periodic tick (cfg.Period, default 1s) or a state change to retry. This is likely acceptable since the periodic trigger fires quickly, but it's inconsistent with the PR's stated goal of "immediate retry after failure."

If desired, you could queue a retry after the initial failure:

if err != nil {
    p.logger.Error("initial rebalance failed", tag.Error(err))
    select {
    case triggerChan <- "Initial rebalance failed":
    default:
    }
}

✅ 1 resolved

✅ Bug: Race: send on closed triggerChan causes panic

📄 service/sharddistributor/leader/process/processor.go:188 📄 service/sharddistributor/leader/process/processor.go:249-254
Moving channel ownership from rebalanceTriggeringLoop to runRebalancingLoop introduces a send-on-closed-channel panic.

Scenario: When the context is cancelled, both goroutines race on ctx.Done(). If runRebalancingLoop selects ctx.Done() first and returns, defer close(triggerChan) fires. Meanwhile, rebalanceTriggeringLoop may still be running — Go's select is non-deterministic, so it could select a ticker tick or an updateChan message before ctx.Done(). When tryTriggerRebalancing then executes triggerChan <- reason (line 251), it panics with send on closed channel.

Previously, this was safe because rebalanceTriggeringLoop owned and closed the channel itself — writes could never happen after close.

Suggested fix: Remove defer close(triggerChan) and instead synchronize the goroutine's exit before closing, or simply don't close the channel at all (it will be GC'd when both goroutines exit and no references remain). For example:
func (p *namespaceProcessor) runRebalancingLoop(ctx context.Context) {
    triggerChan := make(chan string, 1)
    // No defer close — channel will be GC'd when both goroutines exit.
    
    // ... rest of the function ...
}
If you need to close the channel (e.g., for signaling), you must wait for the triggering goroutine to exit first using a sync.WaitGroup or similar mechanism.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

jakobht

Looks good! Great change.

I think we should refactor this goroutine the same way that was done here #7697

But in a different PR :)

arzonus · 2026-02-19T14:03:39Z

Looks good! Great change.

I think we should refactor this goroutine the same way that was done here #7697

But in a different PR :)

Yeah, I also think it is worth separating the rebalancer runner and the rebalance itself, it should make testing easier.

gitar-bot bot reviewed Feb 19, 2026

View reviewed changes

service/sharddistributor/leader/process/processor.go Outdated Show resolved Hide resolved

gitar-bot bot reviewed Feb 19, 2026

View reviewed changes

add an immediate retry after a failure of the rebalancing loop

e2f5c18

arzonus force-pushed the fix-rebalance-loop-error branch from 0e64e6a to e2f5c18 Compare February 19, 2026 11:59

jakobht approved these changes Feb 19, 2026

View reviewed changes

arzonus merged commit f551af0 into cadence-workflow:master Feb 19, 2026
42 checks passed

arzonus deleted the fix-rebalance-loop-error branch February 19, 2026 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shard-distributor): add an immediate retry after a failure of the rebalancing loop#7721

fix(shard-distributor): add an immediate retry after a failure of the rebalancing loop#7721
arzonus merged 1 commit intocadence-workflow:masterfrom
arzonus:fix-rebalance-loop-error

arzonus commented Feb 19, 2026

Uh oh!

Uh oh!

gitar-bot bot Feb 19, 2026

Uh oh!

gitar-bot bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

jakobht left a comment

Uh oh!

arzonus commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arzonus commented Feb 19, 2026

Reviewer Validation

Uh oh!

Uh oh!

gitar-bot bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakobht left a comment

Choose a reason for hiding this comment

Uh oh!

arzonus commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gitar-bot bot commented Feb 19, 2026 •

edited

Loading