Skip to content

CASSANDRA-20476 & CASSANDRA-20736 Handle CMS member addresses changing concurrently #4613

Open
beobal wants to merge 23 commits intoapache:trunkfrom
beobal:samt/CASSANDRA-20476
Open

CASSANDRA-20476 & CASSANDRA-20736 Handle CMS member addresses changing concurrently #4613
beobal wants to merge 23 commits intoapache:trunkfrom
beobal:samt/CASSANDRA-20476

Conversation

@beobal
Copy link
Contributor

@beobal beobal commented Feb 13, 2026

Changing broadcast address has always been supported, but it requires the node to inform the CMS of the change at startup. If a majority of the CMS members attempt to do this concurrently, they have no way to establish the quorum required to make those metadata changes, leading to a deadlocked startup.
This is addressed by the combination of 2 patchsets:

  • CASSANDRA-20736 modifies ClusterMetadata to represent the CMS membership as a set of node ids, rather than addresses.
  • CASSANDRA-20476 introduces a protocol for nodes starting up to discover the current address for CMS members if they have changed while that node was down. The node can then construct a temporary address lookup which it uses to establish contact with CMS members and update/get the latest agreed ClusterMetadata. When the starting node is itself a CMS member, this lookup enables it to form a consensus group with the other members so that address changes can be durably committed & disseminated.

beobal and others added 23 commits February 2, 2026 11:40
@beobal beobal requested review from krummas and removed request for krummas February 13, 2026 18:10
@beobal beobal changed the title CASSANDRA-20476 & CASSANDRA-20736 Handle all CMS member addresses changing concurrently CASSANDRA-20476 & CASSANDRA-20736 Handle CMS member addresses changing concurrently Feb 13, 2026

public ImmutableSet<NodeId> joiningMembers()
{
return ImmutableSet.copyOf(joiningMembers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; BTreeSet is already immutable, we probably don't need to copy it


/**
* Used to derive a CMSMembership when deserializing a ClusterMetadata instance written with a metadata version
* prior to V7. At that time, CMS membership was always inferred from the data placements of the distributed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"prior to V9" I think?

}

private final Map<NodeId, Pair<InetAddressAndPort, InetAddressAndPort>> overrides;
private final BiMap<InetAddressAndPort, InetAddressAndPort> addressMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressMap is only used in the toString method

return new InitialBuilder(metadata);
}

private final Map<NodeId, Pair<InetAddressAndPort, InetAddressAndPort>> overrides;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be an ImmutableMap for clarity?

and if we make InitialBuilder and rebuild below build immutablemaps we can avoid the copying

return state == State.ACTIVE;
}

public InetAddressAndPort getAddressOverride(NodeId id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

else
{
// This cluster did not previously upgrade from a gossip based version (i.e. pre-6.0) but did at some point
// run a version prior to MetadataVersion.V7 where we started to encode CMS membership directly. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V9

// so we can derive the CMSMembership using the data placement and directory.
DataPlacement placement = placements.get(metadataKs.params.replication);
cmsMembership = CMSMembership.reconstruct(placement, dir);
placements = placements.unbuild().without(metadataKs.params.replication).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary - we do the same thing directly after the if stmt

int currentRound = 0;
long roundTimeNanos = Math.min(TimeUnit.SECONDS.toNanos(4),
DatabaseDescriptor.getDiscoveryTimeout(TimeUnit.NANOSECONDS) / maxRounds);
// TODO a non-CMS node only needs to be able to contact a single CMS member to commit its STARTUP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fix this? It feels like we'll most often discover the full CMS if its up

and if it is not yet up, it might be better to wait here before trying to commit Startup?


int maxRounds = 5;
int currentRound = 0;
long roundTimeNanos = Math.min(TimeUnit.SECONDS.toNanos(4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 4s enough here? Should we add another "discover survey" config setting?

Map<NodeId, InetAddressAndPort> confirmedCMS = new HashMap<>();

Set<InetAddressAndPort> candidates = new HashSet<>(previousCMS.values());
candidates.add(newAddress);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we don't add the seeds to candidates here? Feels like it could save us a discovery round

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants