Skip to content

Commit 29c5508

Browse files
authored
TQ: Introduce tqdb (#8801)
This PR builds on #8753 This is a hefty PR, but it's not as bad as it looks. Over 4k lines of it is in the example log file in the second commit. There's also some moved and unmodified code that I'll point out. This PR introduces a new test tool for the trust-quorum protocol: tqdb. tqdb is a repl that takes event traces produced by the `cluster` proptest and uses them for deterministic replay of actions against test state. The test state includes a "universe" of real protocol nodes, a fake nexus, and fake networks. The proptest and debugging state is shared and contained in the `trust-quorum-test-utils`. The debugger allows a variety of functionality including stepping through individual events, setting breakpoints, snapshotting and diffing states and viewing the event log itself. The purpose of tqdb is twofold: 1. Allow for debugging of failed proptests. This is non-trivial in some cases, even with shrunken tests, because the generated actions are high-level and are all generated up front. The actual operations such as reconfigurations are derived from these high level random generations in conjunction with the current state of the system. Therefore the set of failing generated actions doesn't really tell you much. You have to look at the logs, and the assertion that fired and reason about it with incomplete information. Now, for each concrete action taken, we record the event in a log. In the case of a failure an event log can be loaded into tqdb, with a breakpoint set right before the failure. A snapshot of the state can be taken, and then the failing event can be applied. The diff will tell you what changed and allow you to inspect the actual state of the system. Full visibility into your failure is now possible. 2. The trust quorum protocol is non-trivial. Tqdb allows developers to see in detail how the protocol behaves and understand what is happening in certain situations. Event logs can be created by hand (or script) for particularly interesting scenarios and then run through tqdb. In order to get the diff functionality to work as I wanted, I had to implement `Eq` for types that implemented `subtle::ConstantTimeEq` in both `gfss` (our secret sharing library), and `trust-quorum` crates. However the safety in terms of the compiler breaking the constant time guarantees is unknown. Therefore, a feature flag was added such that only `test-utils` and `tqdb` crates are able to use these implementations. They are not used in the production codebase. Feature unification is not at play here because neither `test-utils` or `tqdb` are part of the product.
1 parent 9577096 commit 29c5508

28 files changed

+7175
-739
lines changed

Cargo.lock

Lines changed: 42 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,8 @@ members = [
141141
"test-utils",
142142
"trust-quorum",
143143
"trust-quorum/gfss",
144+
"trust-quorum/test-utils",
145+
"trust-quorum/tqdb",
144146
"typed-rng",
145147
"update-common",
146148
"update-engine",
@@ -298,6 +300,8 @@ default-members = [
298300
"sp-sim",
299301
"trust-quorum",
300302
"trust-quorum/gfss",
303+
"trust-quorum/test-utils",
304+
"trust-quorum/tqdb",
301305
"test-utils",
302306
"typed-rng",
303307
"update-common",
@@ -460,6 +464,8 @@ gateway-test-utils = { path = "gateway-test-utils" }
460464
gateway-types = { path = "gateway-types" }
461465
gethostname = "0.5.0"
462466
gfss = { path = "trust-quorum/gfss" }
467+
trust-quorum = { path = "trust-quorum" }
468+
trust-quorum-test-utils = { path = "trust-quorum/test-utils" }
463469
glob = "0.3.2"
464470
guppy = "0.17.20"
465471
headers = "0.4.1"

dev-tools/reconfigurator-cli/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ use iddqd::IdOrdMap;
1515
use indent_write::fmt::IndentWriter;
1616
use internal_dns_types::diff::DnsDiff;
1717
use itertools::Itertools;
18-
use log_capture::LogCapture;
18+
pub use log_capture::LogCapture;
1919
use nexus_inventory::CollectionBuilder;
2020
use nexus_reconfigurator_blippy::Blippy;
2121
use nexus_reconfigurator_blippy::BlippyReportSortKey;

dev-tools/repl-utils/src/lib.rs

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ use anyhow::anyhow;
99
use anyhow::bail;
1010
use camino::Utf8Path;
1111
use clap::Parser;
12+
use reedline::Prompt;
1213
use reedline::Reedline;
1314
use reedline::Signal;
1415
use std::fs::File;
@@ -110,13 +111,24 @@ pub fn run_repl_from_file<C: Parser>(
110111
pub fn run_repl_on_stdin<C: Parser>(
111112
run_one: &mut dyn FnMut(C) -> anyhow::Result<Option<String>>,
112113
) -> anyhow::Result<()> {
113-
let mut ed = Reedline::create();
114+
let ed = Reedline::create();
114115
let prompt = reedline::DefaultPrompt::new(
115116
reedline::DefaultPromptSegment::Empty,
116117
reedline::DefaultPromptSegment::Empty,
117118
);
119+
run_repl_on_stdin_customized(ed, &prompt, run_one)
120+
}
121+
122+
/// Runs a REPL using stdin/stdout with a customized `Reedline` and `Prompt`
123+
///
124+
/// See docs for [`run_repl_on_stdin`]
125+
pub fn run_repl_on_stdin_customized<C: Parser>(
126+
mut ed: Reedline,
127+
prompt: &dyn Prompt,
128+
run_one: &mut dyn FnMut(C) -> anyhow::Result<Option<String>>,
129+
) -> anyhow::Result<()> {
118130
loop {
119-
match ed.read_line(&prompt) {
131+
match ed.read_line(prompt) {
120132
Ok(Signal::Success(buffer)) => {
121133
// Strip everything after '#' as a comment.
122134
let entry = match buffer.split_once('#') {

trust-quorum/Cargo.toml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ license = "MPL-2.0"
88
workspace = true
99

1010
[dependencies]
11+
anyhow.workspace = true
1112
bcs.workspace = true
1213
bootstore.workspace = true
1314
camino.workspace = true
@@ -36,6 +37,18 @@ omicron-workspace-hack.workspace = true
3637

3738
[dev-dependencies]
3839
assert_matches.workspace = true
40+
dropshot.workspace = true
3941
omicron-test-utils.workspace = true
4042
proptest.workspace = true
43+
serde_json.workspace = true
4144
test-strategy.workspace = true
45+
trust-quorum-test-utils.workspace = true
46+
47+
[features]
48+
# Impl `PartialEq` and `Eq` for types implementing `subtle::ConstantTimeEq` when
49+
# this feature is enabled.
50+
#
51+
# This is of unknown risk. The rust compiler may obviate the security of using
52+
# subtle when we do this. On the other hand its very useful for testing and
53+
# debugging outside of production.
54+
danger_partial_eq_ct_wrapper = ["gfss/danger_partial_eq_ct_wrapper"]

trust-quorum/gfss/Cargo.toml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,14 @@ omicron-workspace-hack.workspace = true
2121
[dev-dependencies]
2222
proptest.workspace = true
2323
test-strategy.workspace = true
24+
25+
[features]
26+
27+
28+
# Impl `PartialEq` and `Eq` for types implementing `subtle::ConstantTimeEq` when
29+
# this feature is enabled.
30+
#
31+
# This is of unknown risk. The rust compiler may obviate the security of using
32+
# subtle when we do this. On the other hand its very useful for testing and
33+
# debugging outside of production.
34+
danger_partial_eq_ct_wrapper = []

trust-quorum/gfss/src/gf256.rs

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ use zeroize::Zeroize;
3232

3333
/// An element in a finite field of prime power 2^8
3434
///
35-
/// We explicitly don't enable the equality operators to prevent ourselves from
35+
/// We explicitly don't derive the equality operators to prevent ourselves from
3636
/// accidentally using those instead of the constant time ones.
3737
#[repr(transparent)]
3838
#[derive(Debug, Clone, Copy, Zeroize, Serialize, Deserialize)]
@@ -120,6 +120,15 @@ impl ConstantTimeEq for Gf256 {
120120
}
121121
}
122122

123+
#[cfg(feature = "danger_partial_eq_ct_wrapper")]
124+
impl PartialEq for Gf256 {
125+
fn eq(&self, other: &Self) -> bool {
126+
self.ct_eq(&other).into()
127+
}
128+
}
129+
#[cfg(feature = "danger_partial_eq_ct_wrapper")]
130+
impl Eq for Gf256 {}
131+
123132
impl Add for Gf256 {
124133
type Output = Self;
125134

trust-quorum/gfss/src/shamir.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,16 @@ impl Share {
137137
}
138138
}
139139

140+
#[cfg(feature = "danger_partial_eq_ct_wrapper")]
141+
impl PartialEq for Share {
142+
fn eq(&self, other: &Self) -> bool {
143+
self.x_coordinate == other.x_coordinate
144+
&& self.y_coordinates == other.y_coordinates
145+
}
146+
}
147+
#[cfg(feature = "danger_partial_eq_ct_wrapper")]
148+
impl Eq for Share {}
149+
140150
impl std::fmt::Debug for Share {
141151
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
142152
f.debug_struct("KeyShareGf256").finish()

trust-quorum/src/compute_key_share.rs

Lines changed: 24 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,17 @@
88
//! share for that configuration it must collect a threshold of key shares from
99
//! other nodes so that it can compute its own key share.
1010
11-
use crate::crypto::Sha3_256Digest;
1211
use crate::{
1312
Alarm, Configuration, Epoch, NodeHandlerCtx, PeerMsgKind, PlatformId,
1413
};
1514
use gfss::gf256::Gf256;
1615
use gfss::shamir::{self, Share};
17-
use slog::{Logger, error, o, warn};
16+
use slog::{Logger, error, o};
1817
use std::collections::BTreeMap;
1918

2019
/// In memory state that tracks retrieval of key shares in order to compute
2120
/// this node's key share for a given configuration.
21+
#[derive(Debug, Clone)]
2222
pub struct KeyShareComputer {
2323
log: Logger,
2424

@@ -28,6 +28,17 @@ pub struct KeyShareComputer {
2828
collected_shares: BTreeMap<PlatformId, Share>,
2929
}
3030

31+
#[cfg(feature = "danger_partial_eq_ct_wrapper")]
32+
impl PartialEq for KeyShareComputer {
33+
fn eq(&self, other: &Self) -> bool {
34+
self.config == other.config
35+
&& self.collected_shares == other.collected_shares
36+
}
37+
}
38+
39+
#[cfg(feature = "danger_partial_eq_ct_wrapper")]
40+
impl Eq for KeyShareComputer {}
41+
3142
impl KeyShareComputer {
3243
pub fn new(
3344
log: &Logger,
@@ -54,7 +65,9 @@ impl KeyShareComputer {
5465
ctx: &mut impl NodeHandlerCtx,
5566
peer: PlatformId,
5667
) {
57-
if !self.collected_shares.contains_key(&peer) {
68+
if self.config.members.contains_key(&peer)
69+
&& !self.collected_shares.contains_key(&peer)
70+
{
5871
ctx.send(peer, PeerMsgKind::GetShare(self.config.epoch));
5972
}
6073
}
@@ -70,55 +83,29 @@ impl KeyShareComputer {
7083
epoch: Epoch,
7184
share: Share,
7285
) -> bool {
73-
// Are we trying to retrieve shares for `epoch`?
74-
if epoch != self.config.epoch {
75-
warn!(
76-
self.log,
77-
"Received Share from node with wrong epoch";
78-
"received_epoch" => %epoch,
79-
"from" => %from
80-
);
81-
return false;
82-
}
83-
84-
// Is the sender a member of the configuration `epoch`?
85-
// Was the sender a member of the configuration at `old_epoch`?
86-
let Some(expected_digest) = self.config.members.get(&from) else {
87-
warn!(
88-
self.log,
89-
"Received Share from unexpected node";
90-
"epoch" => %epoch,
91-
"from" => %from
92-
);
86+
if !crate::validate_share(&self.log, &self.config, &from, epoch, &share)
87+
{
88+
// Logging done inside `validate_share`
9389
return false;
9490
};
9591

96-
// Does the share hash match what we expect?
97-
let mut digest = Sha3_256Digest::default();
98-
share.digest::<sha3::Sha3_256>(&mut digest.0);
99-
if digest != *expected_digest {
100-
error!(
101-
self.log,
102-
"Received share with invalid digest";
103-
"epoch" => %epoch,
104-
"from" => %from
105-
);
106-
return false;
107-
}
108-
10992
// A valid share was received. Is it new?
11093
if self.collected_shares.insert(from, share).is_some() {
11194
return false;
11295
}
11396

114-
// Do we have enough shares to computer our rack share?
97+
// Do we have enough shares to compute our rack share?
11598
if self.collected_shares.len() < self.config.threshold.0 as usize {
11699
return false;
117100
}
118101

102+
// Share indices are assigned according the configuration membership's
103+
// key order, when the configuration is constructed.
104+
//
119105
// What index are we in the configuration? This is our "x-coordinate"
120106
// for our key share calculation. We always start indexing from 1, since
121107
// 0 is the rack secret.
108+
//
122109
let index =
123110
self.config.members.keys().position(|id| id == ctx.platform_id());
124111

trust-quorum/src/configuration.rs

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
use crate::crypto::{EncryptedRackSecrets, RackSecret, Sha3_256Digest};
88
use crate::validators::ValidatedReconfigureMsg;
99
use crate::{Epoch, PlatformId, Threshold};
10+
use daft::Diffable;
1011
use gfss::shamir::{Share, SplitError};
1112
use iddqd::{IdOrdItem, id_upcast};
1213
use omicron_uuid_kinds::RackUuid;
@@ -31,7 +32,15 @@ pub enum ConfigurationError {
3132
///
3233
/// Only valid for non-lrtq configurations
3334
#[derive(
34-
Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize,
35+
Debug,
36+
Clone,
37+
PartialEq,
38+
Eq,
39+
PartialOrd,
40+
Ord,
41+
Serialize,
42+
Deserialize,
43+
Diffable,
3544
)]
3645
pub struct Configuration {
3746
/// Unique Id of the rack

0 commit comments

Comments
 (0)