Skip to content

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Jul 31, 2025

Builds upon #8682

This PR implements the ability to reconfigure the trust quorum after a commit. This includes the ability to fetch shares for the most recently committed configuration to recompute the rack secret and then include that in an encrypted form in the new configuration for key rotation purposes.

The cluster proptest was enhanced to allow this, and it generates enough races - even without crashing and restarting nodes - that it forced the handling of CommitAdvance messages to be implemented. This implementation includes the ability to construct key shares for a new configuration when a node misses a prepare and commit for that configuration. This required adding a KeyShareComputer which collects key shares for the configuration returned in a CommitAdvance so that it can construct its own key share and commit the newly learned configuration.

Importantly, constructing a key share and coordinating a reconfiguration are mutually exclusive, and so a new invariant was added to the cluster test.

We also start keeping track of expunged nodes in the cluster test, although we don't yet inform them that they are expunged if they reach out to other nodes.

There are a few places in the code where a runtime invariant is violated and an error message is logged. This always occurs on message receipt and we don't want to panic at runtime because of an errant message and take down the sled-agent. However, we'd like to be able to report these upstream. The first step here is to be able to report when these situations are hit and put the node in an Alarm state such that it is stuck until remedied via support. We should never see an Alarm in practice, but since the states are possible to reach, we should manage them appropriately. This will come in a follow up PR and be similar to what I implemented in #8062.

@andrewjstone andrewjstone force-pushed the tq-commit-and-prepare-ack branch from 071f1cf to 4bd87c7 Compare August 26, 2025 14:36
Base automatically changed from tq-commit-and-prepare-ack to main August 27, 2025 15:02
Builds upon #8682

This PR implements the ability to reconfigure the trust quorum after a
commit. This includes the ability to fetch shares for the most recently
committed configuration to recompute the rack secret and then include
that in an encrypted form in the new configuration for key rotation
purposes.

The cluster proptest was enhanced to allow this, and it generates
enough races - even without crashing and restarting nodes that it
forced the handling of `CommitAdvance` messages to be implemented.
This implementation includes the ability to construct key shares for
a new configuration when a node misses a prepare and commit for that
configuration. This required adding a `KeyShareComputer` which collects
key shares for the configuration returned in a `CommitAdvance` so
that it can construct its own key share and commit the newly learned
configuration.

Importantly, constructing a key share and coordinating a reconfiguration
are mutually exclusive, and so a new invariant was added to the cluster
test.

We also start keeping track of expunged nodes in the cluster test,
although we don't yet inform them that they are expunged if they reach
out to other nodes.

There are a few places in the code where a runtime invariant is violated
and an error message is logged. This always occurs on message receipt
and we don't want to panic at runtime because of an errant message
and take down the sled-agent. However, we'd like to be able to report
these upstream. The first step here is to be able to report when these
situations are hit and put the node in an `Alarm` state such that it
is stuck until remedied via support. We should *never* see an Alarm in
practice, but since the states are possible to reach, we should manage
them appropriately. This will come in a follow up PR and be similar to
what I implemented in #8062.
@andrewjstone andrewjstone enabled auto-merge (squash) August 27, 2025 16:40
@andrewjstone andrewjstone merged commit 718ded7 into main Aug 27, 2025
16 checks passed
@andrewjstone andrewjstone deleted the tq-reconfigure branch August 27, 2025 19:35
andrewjstone added a commit that referenced this pull request Aug 27, 2025
This builds on #8741

An alarm represents a protocol invariant violation. It's unclear exactly
what should be done about these other than recording them and allowing
them to be reported upstack, which is what is done in this PR. An
argument could be made for "freezing" the state machine such that trust
quorum nodes stop working and the only thing they can do is report alarm
status. However, that would block the trust quorum from operating at
all, and it's unclear if this should cause an outage on that node.

I'm also somewhat hesitant to put the alarms into the persistent state
as that would prevent unlock in the case of a sled/rack reboot.

On the flip side of just recording is the possible danger resulting from
operating with an invariant violation. This could potentially be risky,
and since we shouldn't ever see these maybe pausing for a support call
is the right thing. TBD, once more work is done on the protocol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant