Skip to content

Commit bab40c6

Browse files
committed
TQ: Add support for alarms in the protocol
An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node. I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot. On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.
1 parent cf0b76f commit bab40c6

File tree

7 files changed

+98
-9
lines changed

7 files changed

+98
-9
lines changed

trust-quorum/gfss/src/shamir.rs

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,17 @@ pub enum SplitError {
2323
TooFewTotalShares { n: u8, k: u8 },
2424
}
2525

26-
#[derive(Debug, Clone, thiserror::Error, PartialEq, Eq)]
26+
#[derive(
27+
Debug,
28+
Clone,
29+
thiserror::Error,
30+
PartialEq,
31+
Eq,
32+
PartialOrd,
33+
Ord,
34+
Serialize,
35+
Deserialize,
36+
)]
2737
pub enum CombineError {
2838
#[error("must be at least 2 shares to combine")]
2939
TooFewShares,

trust-quorum/src/alarm.rs

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
4+
5+
//! Mechanism for reporting protocol invariant violations
6+
7+
use serde::{Deserialize, Serialize};
8+
9+
use crate::{Configuration, Epoch};
10+
11+
#[derive(
12+
Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize,
13+
)]
14+
pub enum Alarm {
15+
/// Different configurations found for the same epoch
16+
///
17+
/// Reason: Nexus creates configurations and stores them in CRDB before
18+
/// sending them to a coordinator of its choosing. Nexus will not send the
19+
/// same reconfiguration request to different coordinators. If it does those
20+
/// coordinators will generate different key shares. However, since Nexus
21+
/// will not tell different nodes to coordinate the same configuration, this
22+
/// state should be impossible to reach.
23+
MismatchedConfigurations { config1: Configuration, config2: Configuration },
24+
25+
/// We received a `CommitAdvance` while coordinating for the same epoch
26+
///
27+
/// Reason: `CommitAdvance` is a reply for a key share request in an
28+
/// old epoch that we don't have the latest committed coordination.
29+
/// However we are actually the coordinator for the configuration in the
30+
/// `CommitAdvance`. While it's possible that another node could learn
31+
/// of the commit from Nexus before the coordinator, the coordinator will
32+
/// never ask for a key share for that epoch. Therefore this state should be
33+
/// impossible to reach.
34+
CommitAdvanceForCoordinatingEpoch { config: Configuration },
35+
36+
/// The `keyShareComputer` could not compute this node's share
37+
///
38+
/// Reason: A threshold of valid key shares were received based on the the
39+
/// share digests in the Configuration. However, computation of the share
40+
/// still failed. This should be impossible.
41+
ShareComputationFailed { epoch: Epoch, err: gfss::shamir::CombineError },
42+
}

trust-quorum/src/compute_key_share.rs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@
99
//! other nodes so that it can compute its own key share.
1010
1111
use crate::crypto::Sha3_256Digest;
12-
use crate::{Configuration, Epoch, NodeHandlerCtx, PeerMsgKind, PlatformId};
12+
use crate::{
13+
Alarm, Configuration, Epoch, NodeHandlerCtx, PeerMsgKind, PlatformId,
14+
};
1315
use gfss::gf256::Gf256;
1416
use gfss::shamir::{self, Share};
1517
use slog::{Logger, error, o, warn};
@@ -137,11 +139,9 @@ impl KeyShareComputer {
137139
});
138140
true
139141
}
140-
Err(e) => {
141-
// TODO: put the node into into an `Alarm` state similar to
142-
// https://github.com/oxidecomputer/omicron/pull/8062 once we
143-
// have alarms?
144-
error!(self.log, "Failed to compute share: {}", e);
142+
Err(err) => {
143+
error!(self.log, "Failed to compute share: {}", err);
144+
ctx.raise_alarm(Alarm::ShareComputationFailed { epoch, err });
145145
false
146146
}
147147
}

trust-quorum/src/configuration.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ pub enum ConfigurationError {
3030
/// The configuration for a given epoch.
3131
///
3232
/// Only valid for non-lrtq configurations
33-
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
33+
#[derive(
34+
Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize,
35+
)]
3436
pub struct Configuration {
3537
/// Unique Id of the rack
3638
pub rack_id: RackUuid,

trust-quorum/src/lib.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,9 @@ mod persistent_state;
2323
mod validators;
2424
pub use configuration::Configuration;
2525
pub use coordinator_state::{CoordinatorOperation, CoordinatorState};
26+
mod alarm;
2627

28+
pub use alarm::Alarm;
2729
pub use crypto::RackSecret;
2830
pub use messages::*;
2931
pub use node::Node;

trust-quorum/src/node_ctx.rs

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,17 @@
44

55
//! Parameter to Node API calls that allows interaction with the system at large
66
7-
use crate::{Envelope, PeerMsg, PeerMsgKind, PersistentState, PlatformId};
7+
use crate::{
8+
Alarm, Envelope, PeerMsg, PeerMsgKind, PersistentState, PlatformId,
9+
};
810
use std::collections::BTreeSet;
911

1012
/// An API shared by [`NodeCallerCtx`] and [`NodeHandlerCtx`]
1113
pub trait NodeCommonCtx {
1214
fn platform_id(&self) -> &PlatformId;
1315
fn persistent_state(&self) -> &PersistentState;
1416
fn connected(&self) -> &BTreeSet<PlatformId>;
17+
fn alarms(&self) -> &BTreeSet<Alarm>;
1518
}
1619

1720
/// An API for an [`NodeCtx`] usable from a [`crate::Node`]
@@ -54,6 +57,9 @@ pub trait NodeHandlerCtx: NodeCommonCtx {
5457

5558
/// Remove a peer from the connected set
5659
fn remove_connection(&mut self, id: &PlatformId);
60+
61+
/// Record (in-memory) that an alarm has occurred
62+
fn raise_alarm(&mut self, alarm: Alarm);
5763
}
5864

5965
/// Common parameter to [`crate::Node`] methods
@@ -79,6 +85,9 @@ pub struct NodeCtx {
7985

8086
/// Connected peer nodes
8187
connected: BTreeSet<PlatformId>,
88+
89+
/// Any alarms that have occurred
90+
alarms: BTreeSet<Alarm>,
8291
}
8392

8493
impl NodeCtx {
@@ -89,6 +98,7 @@ impl NodeCtx {
8998
persistent_state_changed: false,
9099
outgoing: Vec::new(),
91100
connected: BTreeSet::new(),
101+
alarms: BTreeSet::new(),
92102
}
93103
}
94104
}
@@ -105,6 +115,10 @@ impl NodeCommonCtx for NodeCtx {
105115
fn connected(&self) -> &BTreeSet<PlatformId> {
106116
&self.connected
107117
}
118+
119+
fn alarms(&self) -> &BTreeSet<Alarm> {
120+
&self.alarms
121+
}
108122
}
109123

110124
impl NodeHandlerCtx for NodeCtx {
@@ -138,6 +152,10 @@ impl NodeHandlerCtx for NodeCtx {
138152
fn remove_connection(&mut self, id: &PlatformId) {
139153
self.connected.remove(id);
140154
}
155+
156+
fn raise_alarm(&mut self, alarm: Alarm) {
157+
self.alarms.insert(alarm);
158+
}
141159
}
142160

143161
impl NodeCallerCtx for NodeCtx {

trust-quorum/tests/cluster.rs

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -834,6 +834,7 @@ impl TestState {
834834
self.invariant_nodes_have_prepared_if_coordinator_has_acks()?;
835835
self.invariant_nodes_have_committed_if_nexus_has_acks()?;
836836
self.invariant_nodes_not_coordinating_and_computing_key_share_simultaneously()?;
837+
self.invariant_no_alarms()?;
837838
Ok(())
838839
}
839840

@@ -953,6 +954,20 @@ impl TestState {
953954

954955
Ok(())
955956
}
957+
958+
// Ensure there has been no alarm at any node
959+
fn invariant_no_alarms(&self) -> Result<(), TestCaseError> {
960+
for (id, (_, ctx)) in &self.sut.nodes {
961+
let alarms = ctx.alarms();
962+
prop_assert!(
963+
alarms.is_empty(),
964+
"Alarms found for {}: {:#?}",
965+
id,
966+
alarms
967+
);
968+
}
969+
Ok(())
970+
}
956971
}
957972

958973
/// Broken out of `TestState` to alleviate borrow checker woes

0 commit comments

Comments
 (0)