Port Pekko ShardStopped handler + handoff safety net by Aaronontheweb · Pull Request #8055 · akkadotnet/akka.net

Aaronontheweb · 2026-02-25T04:21:01Z

Summary

Fixes #7500 - Shards can fail to HandOff indefinitely during scale-up events.

Root cause: When a RebalanceWorker times out (single 60s timer covers both BeginHandOffAck + ShardStopped phases), the ShardStopped message from HandOffStopper goes to the dead RebalanceWorker and is lost. The coordinator never deallocates the shard, so entity traffic triggers GetShardHome → ShardHome → shard recreation → repeat (10-30 minutes of unhandled HandOff messages).

Changes:

ShardCoordinator.cs: Added ShardStopped handler to Active() (ported from Pekko). Cleans up _unAckedHostShards and performs late deallocation when no RebalanceWorker is active for the shard (!_rebalanceInProgress.ContainsKey). This is the safety net — if the worker already timed out, the coordinator can still deallocate and reallocate the shard.
ShardRegion.cs: HandleTerminated() now sends a backup ShardStopped to the coordinator when a handoff completes. This ensures the coordinator receives the stop notification even when the RebalanceWorker has already timed out. The coordinator handler is idempotent — if a rebalance is still in progress, the deallocation is skipped (the worker handles it).

Test plan

dotnet build -c Release -warnaserror — 0 warnings, 0 errors
Akka.Cluster.Sharding.Tests — 188/190 passed (2 failures are pre-existing RememberEntitiesStarterSpec flakes, unrelated)
ShardStopped is an internal message — no public API surface changes
Manual verification: coordinator correctly deallocates shards that complete handoff after RebalanceWorker timeout

Shards can fail to HandOff indefinitely during scale-up when the RebalanceWorker times out before receiving ShardStopped. The coordinator never deallocates the shard, causing an endless GetShardHome/ShardHome loop. - Add ShardStopped handler to ShardCoordinator.Active() (Pekko port): cleans up unAckedHostShards and performs late deallocation when no rebalance is in progress for the shard - ShardRegion sends backup ShardStopped to coordinator on handoff completion, ensuring the coordinator learns about it even when the RebalanceWorker has already timed out

Aaronontheweb

Detailed my changes

Aaronontheweb · 2026-02-25T04:37:45Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardCoordinator.cs

+                    // Safety net: if no rebalance is in progress for this shard (RebalanceWorker
+                    // already timed out), deallocate the shard so it can be reallocated elsewhere.
+                    // This prevents the shard from being endlessly recreated via GetShardHome/ShardHome.
+                    if (!_rebalanceInProgress.ContainsKey(m.Shard) && State.Shards.ContainsKey(m.Shard))


I have seen a ton of GetShardHome / ShardHome spam in my apps and I assumed it was Phobos' sharding metric polling responsible for that. Apparently this bug is also a big contributor.

Aaronontheweb · 2026-02-25T04:39:07Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardRegion.cs

+                    // has already timed out and missed the ShardStopped from HandOffStopper.
+                    // The coordinator's Active handler will only deallocate if no rebalance
+                    // is currently in progress for this shard.
+                    _coordinator?.Tell(new ShardCoordinator.ShardStopped(shard));


allows us to double-tap the ShardStopped message handling in case the RebalanceWorker has died already

Aaronontheweb added 2 commits February 24, 2026 22:20

Merge branch 'dev' into fix/shard-handoff-safety-net

0facf90

Aaronontheweb commented Feb 25, 2026

View reviewed changes

Aaronontheweb added the akka-cluster-sharding label Feb 25, 2026

Aaronontheweb enabled auto-merge (squash) February 25, 2026 04:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Port Pekko ShardStopped handler + handoff safety net#8055

Port Pekko ShardStopped handler + handoff safety net#8055
Aaronontheweb wants to merge 2 commits intoakkadotnet:devfrom
Aaronontheweb:fix/shard-handoff-safety-net

Aaronontheweb commented Feb 25, 2026

Uh oh!

Aaronontheweb left a comment

Uh oh!

Aaronontheweb Feb 25, 2026

Uh oh!

Aaronontheweb Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Aaronontheweb commented Feb 25, 2026

Summary

Test plan

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant