Skip to content

Commit 7ba1986

Browse files
committed
TQ: Add support for reconfiguration
Builds upon #8682 This PR implements the ability to reconfigure the trust quorum after a commit. This includes the ability to fetch shares for the most recently committed configuration to recompute the rack secret and then include that in an encrypted form in the new configuration for key rotation purposes. The cluster proptest was enhanced to allow this, and it generates enough races - even without crashing and restarting nodes that it forced the handling of `CommitAdvance` messages to be implemented. This implementation includes the ability to construct key shares for a new configuration when a node misses a prepare and commit for that configuration. This required adding a `KeyShareComputer` which collects key shares for the configuration returned in a `CommitAdvance` so that it can construct its own key share and commit the newly learned configuration. Importantly, constructing a key share and coordinating a reconfiguration are mutually exclusive, and so a new invariant was added to the cluster test. We also start keeping track of expunged nodes in the cluster test, although we don't yet inform them that they are expunged if they reach out to other nodes. There are a few places in the code where a runtime invariant is violated and an error message is logged. This always occurs on message receipt and we don't want to panic at runtime because of an errant message and take down the sled-agent. However, we'd like to be able to report these upstream. The first step here is to be able to report when these situations are hit and put the node in an `Alarm` state such that it is stuck until remedied via support. We should *never* see an Alarm in practice, but since the states are possible to reach, we should manage them appropriately. This will come in a follow up PR and be similar to what I implemented in #8062.
1 parent 071f1cf commit 7ba1986

File tree

8 files changed

+1068
-105
lines changed

8 files changed

+1068
-105
lines changed

trust-quorum/src/compute_key_share.rs

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
4+
5+
//! Mechanism to track and compute this node's key share for a configuration
6+
//!
7+
//! When a node learns of a committed configuration but does not have a key
8+
//! share for that configuration it must collect a threshold of key shares from
9+
//! other nodes so that it can compute its own key share.
10+
11+
use crate::crypto::Sha3_256Digest;
12+
use crate::{Configuration, Epoch, NodeHandlerCtx, PeerMsgKind, PlatformId};
13+
use gfss::gf256::Gf256;
14+
use gfss::shamir::{self, Share};
15+
use slog::{Logger, error, o, warn};
16+
use std::collections::BTreeMap;
17+
18+
/// In memory state that tracks retrieval of key shares in order to compute
19+
/// this node's key share for a given configuration.
20+
pub struct KeyShareComputer {
21+
log: Logger,
22+
23+
// A copy of the configuration stored in persistent state
24+
config: Configuration,
25+
26+
collected_shares: BTreeMap<PlatformId, Share>,
27+
}
28+
29+
impl KeyShareComputer {
30+
pub fn new(
31+
log: &Logger,
32+
ctx: &mut impl NodeHandlerCtx,
33+
config: Configuration,
34+
) -> KeyShareComputer {
35+
let log = log.new(o!("component" => "tq-key-share-computer"));
36+
37+
for id in config.members.keys() {
38+
if ctx.connected().contains(id) {
39+
ctx.send(id.clone(), PeerMsgKind::GetShare(config.epoch));
40+
}
41+
}
42+
43+
KeyShareComputer { log, config, collected_shares: BTreeMap::new() }
44+
}
45+
46+
pub fn config(&self) -> &Configuration {
47+
&self.config
48+
}
49+
50+
pub fn on_connect(
51+
&mut self,
52+
ctx: &mut impl NodeHandlerCtx,
53+
peer: PlatformId,
54+
) {
55+
if !self.collected_shares.contains_key(&peer) {
56+
ctx.send(peer, PeerMsgKind::GetShare(self.config.epoch));
57+
}
58+
}
59+
60+
/// We received a key share
61+
///
62+
/// Return true if we have computed and saved our key share to the
63+
/// persistent state, false otherwise.
64+
pub fn handle_share(
65+
&mut self,
66+
ctx: &mut impl NodeHandlerCtx,
67+
from: PlatformId,
68+
epoch: Epoch,
69+
share: Share,
70+
) -> bool {
71+
// Are we trying to retrieve shares for `epoch`?
72+
if epoch != self.config.epoch {
73+
warn!(
74+
self.log,
75+
"Received Share from node with wrong epoch";
76+
"received_epoch" => %epoch,
77+
"from" => %from
78+
);
79+
return false;
80+
}
81+
82+
// Is the sender a member of the configuration `epoch`?
83+
// Was the sender a member of the configuration at `old_epoch`?
84+
let Some(expected_digest) = self.config.members.get(&from) else {
85+
warn!(
86+
self.log,
87+
"Received Share from unexpected node";
88+
"epoch" => %epoch,
89+
"from" => %from
90+
);
91+
return false;
92+
};
93+
94+
// Does the share hash match what we expect?
95+
let mut digest = Sha3_256Digest::default();
96+
share.digest::<sha3::Sha3_256>(&mut digest.0);
97+
if digest != *expected_digest {
98+
error!(
99+
self.log,
100+
"Received share with invalid digest";
101+
"epoch" => %epoch,
102+
"from" => %from
103+
);
104+
}
105+
106+
// A valid share was received. Is it new?
107+
if self.collected_shares.insert(from, share).is_some() {
108+
return false;
109+
}
110+
111+
// Do we have enough shares to computer our rack share?
112+
if self.collected_shares.len() < self.config.threshold.0 as usize {
113+
return false;
114+
}
115+
116+
// What index are we in the configuration? This is our "x-coordinate"
117+
// for our key share calculation. We always start indexing from 1, since
118+
// 0 is the rack secret.
119+
let index = self
120+
.config
121+
.members
122+
.keys()
123+
.position(|id| id == ctx.platform_id())
124+
.expect("node exists");
125+
let x_coordinate =
126+
Gf256::new(u8::try_from(index + 1).expect("index fits in u8"));
127+
128+
let shares: Vec<_> = self.collected_shares.values().cloned().collect();
129+
130+
match shamir::compute_share(&shares, x_coordinate) {
131+
Ok(our_share) => {
132+
ctx.update_persistent_state(|ps| {
133+
let inserted_share =
134+
ps.shares.insert(epoch, our_share).is_none();
135+
let inserted_commit = ps.commits.insert(epoch);
136+
inserted_share || inserted_commit
137+
});
138+
true
139+
}
140+
Err(e) => {
141+
// TODO: put the node into into an `Alarm` state similar to
142+
// https://github.com/oxidecomputer/omicron/pull/8062 once we
143+
// have alarms?
144+
error!(self.log, "Failed to compute share: {}", e);
145+
false
146+
}
147+
}
148+
}
149+
}

0 commit comments

Comments
 (0)