TQ: Add support for reconfiguration #8741
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Builds upon #8682
This PR implements the ability to reconfigure the trust quorum after a commit. This includes the ability to fetch shares for the most recently committed configuration to recompute the rack secret and then include that in an encrypted form in the new configuration for key rotation purposes.
The cluster proptest was enhanced to allow this, and it generates enough races - even without crashing and restarting nodes - that it forced the handling of
CommitAdvance
messages to be implemented. This implementation includes the ability to construct key shares for a new configuration when a node misses a prepare and commit for that configuration. This required adding aKeyShareComputer
which collects key shares for the configuration returned in aCommitAdvance
so that it can construct its own key share and commit the newly learned configuration.Importantly, constructing a key share and coordinating a reconfiguration are mutually exclusive, and so a new invariant was added to the cluster test.
We also start keeping track of expunged nodes in the cluster test, although we don't yet inform them that they are expunged if they reach out to other nodes.
There are a few places in the code where a runtime invariant is violated and an error message is logged. This always occurs on message receipt and we don't want to panic at runtime because of an errant message and take down the sled-agent. However, we'd like to be able to report these upstream. The first step here is to be able to report when these situations are hit and put the node in an
Alarm
state such that it is stuck until remedied via support. We should never see an Alarm in practice, but since the states are possible to reach, we should manage them appropriately. This will come in a follow up PR and be similar to what I implemented in #8062.