Skip to content

Comments

Port Pekko ShardStopped handler + handoff safety net#8055

Open
Aaronontheweb wants to merge 2 commits intoakkadotnet:devfrom
Aaronontheweb:fix/shard-handoff-safety-net
Open

Port Pekko ShardStopped handler + handoff safety net#8055
Aaronontheweb wants to merge 2 commits intoakkadotnet:devfrom
Aaronontheweb:fix/shard-handoff-safety-net

Conversation

@Aaronontheweb
Copy link
Member

Summary

Fixes #7500 - Shards can fail to HandOff indefinitely during scale-up events.

Root cause: When a RebalanceWorker times out (single 60s timer covers both BeginHandOffAck + ShardStopped phases), the ShardStopped message from HandOffStopper goes to the dead RebalanceWorker and is lost. The coordinator never deallocates the shard, so entity traffic triggers GetShardHomeShardHome → shard recreation → repeat (10-30 minutes of unhandled HandOff messages).

Changes:

  • ShardCoordinator.cs: Added ShardStopped handler to Active() (ported from Pekko). Cleans up _unAckedHostShards and performs late deallocation when no RebalanceWorker is active for the shard (!_rebalanceInProgress.ContainsKey). This is the safety net — if the worker already timed out, the coordinator can still deallocate and reallocate the shard.

  • ShardRegion.cs: HandleTerminated() now sends a backup ShardStopped to the coordinator when a handoff completes. This ensures the coordinator receives the stop notification even when the RebalanceWorker has already timed out. The coordinator handler is idempotent — if a rebalance is still in progress, the deallocation is skipped (the worker handles it).

Test plan

  • dotnet build -c Release -warnaserror — 0 warnings, 0 errors
  • Akka.Cluster.Sharding.Tests — 188/190 passed (2 failures are pre-existing RememberEntitiesStarterSpec flakes, unrelated)
  • ShardStopped is an internal message — no public API surface changes
  • Manual verification: coordinator correctly deallocates shards that complete handoff after RebalanceWorker timeout

Shards can fail to HandOff indefinitely during scale-up when the
RebalanceWorker times out before receiving ShardStopped. The coordinator
never deallocates the shard, causing an endless GetShardHome/ShardHome loop.

- Add ShardStopped handler to ShardCoordinator.Active() (Pekko port):
  cleans up unAckedHostShards and performs late deallocation when no
  rebalance is in progress for the shard
- ShardRegion sends backup ShardStopped to coordinator on handoff
  completion, ensuring the coordinator learns about it even when the
  RebalanceWorker has already timed out
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed my changes

// Safety net: if no rebalance is in progress for this shard (RebalanceWorker
// already timed out), deallocate the shard so it can be reallocated elsewhere.
// This prevents the shard from being endlessly recreated via GetShardHome/ShardHome.
if (!_rebalanceInProgress.ContainsKey(m.Shard) && State.Shards.ContainsKey(m.Shard))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen a ton of GetShardHome / ShardHome spam in my apps and I assumed it was Phobos' sharding metric polling responsible for that. Apparently this bug is also a big contributor.

// has already timed out and missed the ShardStopped from HandOffStopper.
// The coordinator's Active handler will only deallocate if no rebalance
// is currently in progress for this shard.
_coordinator?.Tell(new ShardCoordinator.ShardStopped(shard));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows us to double-tap the ShardStopped message handling in case the RebalanceWorker has died already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Akka.Cluster.Sharding: Shard can fail to HandOff indefinitely

1 participant