Port Pekko ShardStopped handler + handoff safety net#8055
Open
Aaronontheweb wants to merge 2 commits intoakkadotnet:devfrom
Open
Port Pekko ShardStopped handler + handoff safety net#8055Aaronontheweb wants to merge 2 commits intoakkadotnet:devfrom
Aaronontheweb wants to merge 2 commits intoakkadotnet:devfrom
Conversation
Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out
Aaronontheweb
commented
Feb 25, 2026
Member
Author
Aaronontheweb
left a comment
There was a problem hiding this comment.
Detailed my changes
| // Safety net: if no rebalance is in progress for this shard (RebalanceWorker | ||
| // already timed out), deallocate the shard so it can be reallocated elsewhere. | ||
| // This prevents the shard from being endlessly recreated via GetShardHome/ShardHome. | ||
| if (!_rebalanceInProgress.ContainsKey(m.Shard) && State.Shards.ContainsKey(m.Shard)) |
Member
Author
There was a problem hiding this comment.
I have seen a ton of GetShardHome / ShardHome spam in my apps and I assumed it was Phobos' sharding metric polling responsible for that. Apparently this bug is also a big contributor.
| // has already timed out and missed the ShardStopped from HandOffStopper. | ||
| // The coordinator's Active handler will only deallocate if no rebalance | ||
| // is currently in progress for this shard. | ||
| _coordinator?.Tell(new ShardCoordinator.ShardStopped(shard)); |
Member
Author
There was a problem hiding this comment.
allows us to double-tap the ShardStopped message handling in case the RebalanceWorker has died already
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #7500 - Shards can fail to
HandOffindefinitely during scale-up events.Root cause: When a
RebalanceWorkertimes out (single 60s timer covers bothBeginHandOffAck+ShardStoppedphases), theShardStoppedmessage fromHandOffStoppergoes to the deadRebalanceWorkerand is lost. The coordinator never deallocates the shard, so entity traffic triggersGetShardHome→ShardHome→ shard recreation → repeat (10-30 minutes of unhandledHandOffmessages).Changes:
ShardCoordinator.cs: Added
ShardStoppedhandler toActive()(ported from Pekko). Cleans up_unAckedHostShardsand performs late deallocation when noRebalanceWorkeris active for the shard (!_rebalanceInProgress.ContainsKey). This is the safety net — if the worker already timed out, the coordinator can still deallocate and reallocate the shard.ShardRegion.cs:
HandleTerminated()now sends a backupShardStoppedto the coordinator when a handoff completes. This ensures the coordinator receives the stop notification even when theRebalanceWorkerhas already timed out. The coordinator handler is idempotent — if a rebalance is still in progress, the deallocation is skipped (the worker handles it).Test plan
dotnet build -c Release -warnaserror— 0 warnings, 0 errorsAkka.Cluster.Sharding.Tests— 188/190 passed (2 failures are pre-existingRememberEntitiesStarterSpecflakes, unrelated)ShardStoppedis an internal message — no public API surface changes