Skip to content

Gossip protocol can propagate corrupted Seen state - implement defensive fix and tombstones #8015

@Aaronontheweb

Description

@Aaronontheweb

Problem

ClusterMessageSerializer.GossipToProto can throw ArgumentException: Unknown address when serializing gossip because Seen contains UniqueAddress entries that are not present in Members.

Invariant violated: Seen ⊆ Members (all addresses in Seen must be members of the cluster)

Root Cause Analysis

The gossip protocol has a vulnerability where removed members can be reintroduced during gossip merging, and their Seen entries can persist after the member is removed from Members.

Key issue: When a member is removed from the cluster:

  1. The member is removed from Members
  2. The VectorClock is pruned via Prune()
  3. But Seen and Reachability are NOT explicitly cleaned
  4. If stale gossip arrives with the removed member, it can be re-merged

The MergeSeen method (called when VectorClocks compare as Same) performs a blind union:

public Gossip MergeSeen(Gossip that)
{
    return Copy(overview: _overview.Copy(seen: _overview.Seen.Union(that._overview.Seen)));
}

This can introduce addresses into Seen that are no longer in Members.

Related Issue

This is the underlying cause of #8009.

Proposed Solution

Phase 1: Defensive Fix (Immediate)

Apply a defensive fix to MergeSeen that filters the merged result:

public Gossip MergeSeen(Gossip that)
{
    var memberAddresses = _members.Select(m => m.UniqueAddress).ToImmutableHashSet();
    var mergedSeen = _overview.Seen.Union(that._overview.Seen).Intersect(memberAddresses);
    return Copy(overview: _overview.Copy(seen: mergedSeen));
}

This ensures the invariant Seen ⊆ Members is always maintained, regardless of what state arrives via gossip.

Characteristics:

  • Zero breaking changes
  • Wire-compatible with all existing versions
  • Defense-in-depth against corruption from any source
  • Minimal performance impact (set intersection is O(n))

Phase 2: Tombstones (1.6.0)

The proper fix requires implementing tombstones - markers that track removed members to prevent their reintroduction during gossip merging.

Why Tombstones Are Needed

Without tombstones, the following can occur:

  1. Node X is removed from cluster
  2. Leader removes X from Members and prunes VectorClock
  3. Stale gossip containing X arrives from a slow/partitioned node
  4. Member.PickHighestPriority may re-add X because there's no record that X was intentionally removed
  5. VClock entries for X get re-merged, affecting comparison ordering
  6. MergeSeen path gets triggered when it shouldn't

Implementation Spec

1. Add tombstones field to Gossip

public sealed class Gossip
{
    private readonly ImmutableDictionary<UniqueAddress, DateTime> _tombstones;
    public ImmutableDictionary<UniqueAddress, DateTime> Tombstones => _tombstones;
}

2. Add configuration

akka.cluster {
    # Time after which tombstones for removed members are pruned
    prune-gossip-tombstones-after = 24h
}

3. Add Remove() method for atomic member removal

public Gossip Remove(UniqueAddress node, DateTime timestamp)
{
    // Atomically:
    // - Remove from Members
    // - Remove from Seen  
    // - Remove from Reachability
    // - Prune VectorClock
    // - Add tombstone
}

4. Modify Merge() to handle tombstones

  • Merge tombstone maps (keep latest timestamp for each address)
  • Prune VectorClock entries for ALL tombstoned nodes
  • Filter PickHighestPriority results against tombstones

5. Modify Member.PickHighestPriority() to check tombstones

// Reject members that have been tombstoned
if (tombstones.ContainsKey(member.UniqueAddress))
    continue;

6. Add tombstone pruning in leader actions

  • Periodically remove tombstones older than prune-gossip-tombstones-after

7. Update serialization

  • Add TombstoneEntry message to protobuf
  • Tombstones field is optional for backward compatibility during rolling upgrades
  • Full tombstone behavior only activates when all cluster members support it

Behavioral Compatibility Note

Tombstones change gossip merge semantics. In a mixed cluster (1.5.x and 1.6.x nodes):

  • 1.6.x nodes will send tombstones that 1.5.x nodes ignore
  • 1.6.x nodes should detect mixed-version clusters and fall back to defensive-only behavior
  • Full tombstone behavior requires cluster-wide upgrade to 1.6.x

This aligns with other protocol changes planned for 1.6.0.

Test Coverage

Existing tests in GossipMergeSeenCorruptionTests.cs cover:

  • Normal operations maintain invariants
  • Buggy MergeSeen propagates corruption
  • Fixed MergeSeen filters corruption
  • Fix preserves valid entries
  • Fix is idempotent when no corruption exists

Additional tests needed for tombstones:

  • Tombstone merge logic
  • PickHighestPriority with tombstones
  • Tombstone pruning
  • Mixed-version cluster behavior
  • Serialization round-trip

Tasks

  • Apply defensive MergeSeen fix to Gossip.cs
  • Add telemetry/logging when filtering removes entries (aids diagnostics)
  • Document the Seen ⊆ Members invariant in XML doc comments
  • Implement tombstones (1.6.0)
  • Add protocol version negotiation for tombstone support (1.6.0)
  • Update serialization for tombstones (1.6.0)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions