fix(shard-distributor): add an immediate retry after a failure of the rebalancing loop#7721
Conversation
| // Perform an initial rebalance on startup. | ||
| err := p.rebalanceShards(ctx) | ||
| if err != nil { | ||
| p.logger.Error("initial rebalance failed", tag.Error(err)) | ||
| } |
There was a problem hiding this comment.
💡 Edge Case: Initial rebalance failure doesn't trigger immediate retry
When the initial rebalance at line 191 fails, the error is logged but no retry is queued to triggerChan. The new retry-on-failure logic (lines 220-226) only applies to failures within the main loop, not the initial attempt.
This means if the initial rebalance fails, the system must wait until the first periodic tick (cfg.Period, default 1s) or a state change to retry. This is likely acceptable since the periodic trigger fires quickly, but it's inconsistent with the PR's stated goal of "immediate retry after failure."
If desired, you could queue a retry after the initial failure:
if err != nil {
p.logger.Error("initial rebalance failed", tag.Error(err))
select {
case triggerChan <- "Initial rebalance failed":
default:
}
}Was this helpful? React with 👍 / 👎
0e64e6a to
e2f5c18
Compare
Code Review 👍 Approved with suggestions 1 resolved / 2 findingsThe critical race condition (send on closed channel) has been resolved by removing 💡 Edge Case: Initial rebalance failure doesn't trigger immediate retry📄 service/sharddistributor/leader/process/processor.go:190-194 When the initial rebalance at line 191 fails, the error is logged but no retry is queued to This means if the initial rebalance fails, the system must wait until the first periodic tick ( If desired, you could queue a retry after the initial failure: if err != nil {
p.logger.Error("initial rebalance failed", tag.Error(err))
select {
case triggerChan <- "Initial rebalance failed":
default:
}
}✅ 1 resolved✅ Bug: Race: send on closed triggerChan causes panic
OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
Yeah, I also think it is worth separating the rebalancer runner and the rebalance itself, it should make testing easier. |
What changed?
Why?
We observed that the trigger for the rebalancing loop is delayed when a previous rebalance fails, because there are no other triggers except a state change or a periodic time update. We expect that the rebalancing loop should be triggered asap in this case, but with a cooldown to avoid rebalance storms if the underlying issue is persistent.
How did you test it?
Potential risks
N/A
Release notes
N/A
Documentation Changes
N/A
Reviewer Validation
PR Description Quality (check these before reviewing code):
go testinvocation)