TQ: Add support for reconfiguration #8741

andrewjstone · 2025-07-31T22:24:17Z

Builds upon #8682

This PR implements the ability to reconfigure the trust quorum after a commit. This includes the ability to fetch shares for the most recently committed configuration to recompute the rack secret and then include that in an encrypted form in the new configuration for key rotation purposes.

The cluster proptest was enhanced to allow this, and it generates enough races - even without crashing and restarting nodes - that it forced the handling of CommitAdvance messages to be implemented. This implementation includes the ability to construct key shares for a new configuration when a node misses a prepare and commit for that configuration. This required adding a KeyShareComputer which collects key shares for the configuration returned in a CommitAdvance so that it can construct its own key share and commit the newly learned configuration.

Importantly, constructing a key share and coordinating a reconfiguration are mutually exclusive, and so a new invariant was added to the cluster test.

We also start keeping track of expunged nodes in the cluster test, although we don't yet inform them that they are expunged if they reach out to other nodes.

There are a few places in the code where a runtime invariant is violated and an error message is logged. This always occurs on message receipt and we don't want to panic at runtime because of an errant message and take down the sled-agent. However, we'd like to be able to report these upstream. The first step here is to be able to report when these situations are hit and put the node in an Alarm state such that it is stuck until remedied via support. We should never see an Alarm in practice, but since the states are possible to reach, we should manage them appropriately. This will come in a follow up PR and be similar to what I implemented in #8062.

These are for an older PR: #8741 https://gist.github.com/david-crespo/a84474b432090316fa3efcb41335cc24

Builds upon #8682 This PR implements the ability to reconfigure the trust quorum after a commit. This includes the ability to fetch shares for the most recently committed configuration to recompute the rack secret and then include that in an encrypted form in the new configuration for key rotation purposes. The cluster proptest was enhanced to allow this, and it generates enough races - even without crashing and restarting nodes that it forced the handling of `CommitAdvance` messages to be implemented. This implementation includes the ability to construct key shares for a new configuration when a node misses a prepare and commit for that configuration. This required adding a `KeyShareComputer` which collects key shares for the configuration returned in a `CommitAdvance` so that it can construct its own key share and commit the newly learned configuration. Importantly, constructing a key share and coordinating a reconfiguration are mutually exclusive, and so a new invariant was added to the cluster test. We also start keeping track of expunged nodes in the cluster test, although we don't yet inform them that they are expunged if they reach out to other nodes. There are a few places in the code where a runtime invariant is violated and an error message is logged. This always occurs on message receipt and we don't want to panic at runtime because of an errant message and take down the sled-agent. However, we'd like to be able to report these upstream. The first step here is to be able to report when these situations are hit and put the node in an `Alarm` state such that it is stuck until remedied via support. We should *never* see an Alarm in practice, but since the states are possible to reach, we should manage them appropriately. This will come in a follow up PR and be similar to what I implemented in #8062.

This builds on #8741 An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node. I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot. On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.

andrewjstone requested review from plotnick and sunshowers July 31, 2025 22:24

andrewjstone mentioned this pull request Aug 1, 2025

TQ: Add support for alarms in the protocol #8753

Merged

andrewjstone added a commit that referenced this pull request Aug 20, 2025

Fix relevant AI code review suggestions

e904c7f

These are for an older PR: #8741 https://gist.github.com/david-crespo/a84474b432090316fa3efcb41335cc24

andrewjstone force-pushed the tq-commit-and-prepare-ack branch from 071f1cf to 4bd87c7 Compare August 26, 2025 14:36

Base automatically changed from tq-commit-and-prepare-ack to main August 27, 2025 15:02

andrewjstone added 2 commits August 27, 2025 15:05

proptest cleanup

6ef670f

andrewjstone force-pushed the tq-reconfigure branch from cf0b76f to 6ef670f Compare August 27, 2025 15:05

andrewjstone enabled auto-merge (squash) August 27, 2025 16:40

andrewjstone merged commit 718ded7 into main Aug 27, 2025
16 checks passed

andrewjstone deleted the tq-reconfigure branch August 27, 2025 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TQ: Add support for reconfiguration #8741

TQ: Add support for reconfiguration #8741

Uh oh!

andrewjstone commented Jul 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

TQ: Add support for reconfiguration #8741

TQ: Add support for reconfiguration #8741

Uh oh!

Conversation

andrewjstone commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrewjstone commented Jul 31, 2025 •

edited

Loading