-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Problem
ClusterMessageSerializer.GossipToProto can throw ArgumentException: Unknown address when serializing gossip because Seen contains UniqueAddress entries that are not present in Members.
Invariant violated: Seen ⊆ Members (all addresses in Seen must be members of the cluster)
Root Cause Analysis
The gossip protocol has a vulnerability where removed members can be reintroduced during gossip merging, and their Seen entries can persist after the member is removed from Members.
Key issue: When a member is removed from the cluster:
- The member is removed from
Members - The VectorClock is pruned via
Prune() - But
SeenandReachabilityare NOT explicitly cleaned - If stale gossip arrives with the removed member, it can be re-merged
The MergeSeen method (called when VectorClocks compare as Same) performs a blind union:
public Gossip MergeSeen(Gossip that)
{
return Copy(overview: _overview.Copy(seen: _overview.Seen.Union(that._overview.Seen)));
}This can introduce addresses into Seen that are no longer in Members.
Related Issue
This is the underlying cause of #8009.
Proposed Solution
Phase 1: Defensive Fix (Immediate)
Apply a defensive fix to MergeSeen that filters the merged result:
public Gossip MergeSeen(Gossip that)
{
var memberAddresses = _members.Select(m => m.UniqueAddress).ToImmutableHashSet();
var mergedSeen = _overview.Seen.Union(that._overview.Seen).Intersect(memberAddresses);
return Copy(overview: _overview.Copy(seen: mergedSeen));
}This ensures the invariant Seen ⊆ Members is always maintained, regardless of what state arrives via gossip.
Characteristics:
- Zero breaking changes
- Wire-compatible with all existing versions
- Defense-in-depth against corruption from any source
- Minimal performance impact (set intersection is O(n))
Phase 2: Tombstones (1.6.0)
The proper fix requires implementing tombstones - markers that track removed members to prevent their reintroduction during gossip merging.
Why Tombstones Are Needed
Without tombstones, the following can occur:
- Node X is removed from cluster
- Leader removes X from
Membersand prunes VectorClock - Stale gossip containing X arrives from a slow/partitioned node
Member.PickHighestPrioritymay re-add X because there's no record that X was intentionally removed- VClock entries for X get re-merged, affecting comparison ordering
MergeSeenpath gets triggered when it shouldn't
Implementation Spec
1. Add tombstones field to Gossip
public sealed class Gossip
{
private readonly ImmutableDictionary<UniqueAddress, DateTime> _tombstones;
public ImmutableDictionary<UniqueAddress, DateTime> Tombstones => _tombstones;
}2. Add configuration
akka.cluster {
# Time after which tombstones for removed members are pruned
prune-gossip-tombstones-after = 24h
}3. Add Remove() method for atomic member removal
public Gossip Remove(UniqueAddress node, DateTime timestamp)
{
// Atomically:
// - Remove from Members
// - Remove from Seen
// - Remove from Reachability
// - Prune VectorClock
// - Add tombstone
}4. Modify Merge() to handle tombstones
- Merge tombstone maps (keep latest timestamp for each address)
- Prune VectorClock entries for ALL tombstoned nodes
- Filter
PickHighestPriorityresults against tombstones
5. Modify Member.PickHighestPriority() to check tombstones
// Reject members that have been tombstoned
if (tombstones.ContainsKey(member.UniqueAddress))
continue;6. Add tombstone pruning in leader actions
- Periodically remove tombstones older than
prune-gossip-tombstones-after
7. Update serialization
- Add
TombstoneEntrymessage to protobuf - Tombstones field is optional for backward compatibility during rolling upgrades
- Full tombstone behavior only activates when all cluster members support it
Behavioral Compatibility Note
Tombstones change gossip merge semantics. In a mixed cluster (1.5.x and 1.6.x nodes):
- 1.6.x nodes will send tombstones that 1.5.x nodes ignore
- 1.6.x nodes should detect mixed-version clusters and fall back to defensive-only behavior
- Full tombstone behavior requires cluster-wide upgrade to 1.6.x
This aligns with other protocol changes planned for 1.6.0.
Test Coverage
Existing tests in GossipMergeSeenCorruptionTests.cs cover:
- Normal operations maintain invariants
- Buggy
MergeSeenpropagates corruption - Fixed
MergeSeenfilters corruption - Fix preserves valid entries
- Fix is idempotent when no corruption exists
Additional tests needed for tombstones:
- Tombstone merge logic
PickHighestPrioritywith tombstones- Tombstone pruning
- Mixed-version cluster behavior
- Serialization round-trip
Tasks
- Apply defensive
MergeSeenfix toGossip.cs - Add telemetry/logging when filtering removes entries (aids diagnostics)
- Document the
Seen ⊆ Membersinvariant in XML doc comments - Implement tombstones (1.6.0)
- Add protocol version negotiation for tombstone support (1.6.0)
- Update serialization for tombstones (1.6.0)