Skip to content

Commit 672aed5

Browse files
authored
TQ: cluster proptest Action/Event cleanup (#8899)
This builds on #8874 To make using using tqdb easier, we put all necessary information inside events. This works because of deterministic replay using the same infrastucture as the proptest itself. To ensure we are truly deterministically replaying we must assert that any Event in our log has actually been generated when we go to apply it in tqdb. In the common case case this is equivalent to asserting that the `DeliveredEnvelope` is actually the same as the one pulled off the `bootstrap_network` vec during runs. In order to guarantee this a few changes needed to be made: 1. Determinism exists, except for in the crypto code because we don't seed the various random number generators, and therefore different key shares and rack secrets get generated in different runs. We could seed them, but then this makes it possible that we accidentally seed them in production code. More importantly for our current purposes, though, is that it's tedious and unnecessary. Instead we just implement comparison methods for messages that ignore things like key shares. This works fine because if the shares are not self-consistent with the rack secret and the parameters such as threshold that we don't change the crypto code itself will fail immediately. 2. We had a few actions that generated multiple events. Unfortunately, applying these events would result in mutating the test networks multiple times, resulting in a difference between which message was recorded and which was actually pulled out to be delivered. This was fixed by changing each action to only do a single thing at a time, like deliver one envelope or commit a configuration at one node. This also allows for finer grain interleaving and matches better with our TLA+ spec/model checking. This was observable by looking at the event logs. The change to an action typically generating a single event means, however, that we need to generate more actions per run to get the equivalent functionality. A single action won't result in N commits or N envelopes delivered. Therefore, each test run now needs to have significantly more actions generated. Unfortunately this can make test runs even longer. In order to help alleviate this we made three other changes: 1. We change the invariant checks to only look at nodes that could possibly be mutated by an event application. We call these the "affected nodes". This prevents looping over every node in each check. 2. We reduce the member universe, and hence the size of the state space. 3. We reduce the total number of test cases per run.
1 parent 1ad3d57 commit 672aed5

File tree

10 files changed

+258
-129
lines changed

10 files changed

+258
-129
lines changed

Cargo.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -842,6 +842,11 @@ opt-level = 3
842842
[profile.dev.package.bootstore]
843843
opt-level = 3
844844

845+
[profile.dev.package.trust-quorum]
846+
opt-level = 3
847+
[profile.dev.package.trust-quorum-test-utils]
848+
opt-level = 3
849+
845850
# Crypto stuff always needs optimizations
846851
[profile.dev.package.sha3]
847852
opt-level = 3

trust-quorum/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,4 @@ trust-quorum-test-utils.workspace = true
5252
# subtle when we do this. On the other hand its very useful for testing and
5353
# debugging outside of production.
5454
danger_partial_eq_ct_wrapper = ["gfss/danger_partial_eq_ct_wrapper"]
55+
testing = []

trust-quorum/src/configuration.rs

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ use iddqd::{IdOrdItem, id_upcast};
1313
use omicron_uuid_kinds::RackUuid;
1414
use secrecy::ExposeSecret;
1515
use serde::{Deserialize, Serialize};
16+
use serde_with::serde_as;
1617
use slog_error_chain::SlogInlineError;
1718
use std::collections::BTreeMap;
1819

@@ -31,6 +32,7 @@ pub enum ConfigurationError {
3132
/// The configuration for a given epoch.
3233
///
3334
/// Only valid for non-lrtq configurations
35+
#[serde_as]
3436
#[derive(
3537
Debug,
3638
Clone,
@@ -53,6 +55,7 @@ pub struct Configuration {
5355
pub coordinator: PlatformId,
5456

5557
// All members of the current configuration and the hash of their key shares
58+
#[serde_as(as = "Vec<(_, _)>")]
5659
pub members: BTreeMap<PlatformId, Sha3_256Digest>,
5760

5861
/// The number of sleds required to reconstruct the rack secret
@@ -121,4 +124,25 @@ impl Configuration {
121124
shares,
122125
))
123126
}
127+
128+
#[cfg(feature = "testing")]
129+
pub fn equal_except_for_crypto_data(&self, other: &Self) -> bool {
130+
let encrypted_rack_secrets_match =
131+
match (&self.encrypted_rack_secrets, &other.encrypted_rack_secrets)
132+
{
133+
(None, None) => true,
134+
(Some(_), Some(_)) => true,
135+
_ => false,
136+
};
137+
self.rack_id == other.rack_id
138+
&& self.epoch == other.epoch
139+
&& self.coordinator == other.coordinator
140+
&& self
141+
.members
142+
.keys()
143+
.zip(other.members.keys())
144+
.all(|(id1, id2)| id1 == id2)
145+
&& self.threshold == other.threshold
146+
&& encrypted_rack_secrets_match
147+
}
124148
}

trust-quorum/src/lib.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,15 @@ pub struct Envelope {
139139
pub msg: PeerMsg,
140140
}
141141

142+
#[cfg(feature = "testing")]
143+
impl Envelope {
144+
pub fn equal_except_for_crypto_data(&self, other: &Self) -> bool {
145+
self.to == other.to
146+
&& self.from == other.from
147+
&& self.msg.equal_except_for_crypto_data(&other.msg)
148+
}
149+
}
150+
142151
/// Check if a received share is valid for a given configuration
143152
///
144153
/// Return true if valid, false otherwise.

trust-quorum/src/messages.rs

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,14 @@ pub struct PeerMsg {
3030
pub kind: PeerMsgKind,
3131
}
3232

33+
impl PeerMsg {
34+
#[cfg(feature = "testing")]
35+
pub fn equal_except_for_crypto_data(&self, other: &Self) -> bool {
36+
self.rack_id == other.rack_id
37+
&& self.kind.equal_except_for_crypto_data(&other.kind)
38+
}
39+
}
40+
3341
#[derive(Debug, Clone, Serialize, Deserialize)]
3442
#[cfg_attr(feature = "danger_partial_eq_ct_wrapper", derive(PartialEq, Eq))]
3543
pub enum PeerMsgKind {
@@ -92,4 +100,28 @@ impl PeerMsgKind {
92100
Self::CommitAdvance(_) => "commit_advance",
93101
}
94102
}
103+
104+
/// This is useful for our replay tests without having to worry about seeding
105+
/// the various random number generators in our production code.
106+
#[cfg(feature = "testing")]
107+
pub fn equal_except_for_crypto_data(&self, other: &Self) -> bool {
108+
match (self, other) {
109+
(
110+
Self::Prepare { config: config1, .. },
111+
Self::Prepare { config: config2, .. },
112+
) => config1.equal_except_for_crypto_data(config2),
113+
(Self::Config(config1), Self::Config(config2)) => {
114+
config1.equal_except_for_crypto_data(config2)
115+
}
116+
(
117+
Self::Share { epoch: epoch1, .. },
118+
Self::Share { epoch: epoch2, .. },
119+
) => epoch1 == epoch2,
120+
(Self::LrtqShare(_), Self::LrtqShare(_)) => true,
121+
(Self::CommitAdvance(config1), Self::CommitAdvance(config2)) => {
122+
config1.equal_except_for_crypto_data(config2)
123+
}
124+
(s, o) => s == o,
125+
}
126+
}
95127
}

trust-quorum/test-utils/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,6 @@ omicron-uuid-kinds.workspace = true
1616
serde.workspace = true
1717
serde_json.workspace = true
1818
slog.workspace = true
19-
trust-quorum = { workspace = true, features = ["danger_partial_eq_ct_wrapper"] }
19+
trust-quorum = { workspace = true, features = ["danger_partial_eq_ct_wrapper", "testing"] }
2020

2121
omicron-workspace-hack.workspace = true

trust-quorum/test-utils/src/event.rs

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
use crate::nexus::{NexusConfig, NexusReply};
88
use serde::{Deserialize, Serialize};
99
use std::collections::BTreeSet;
10-
use trust_quorum::{Epoch, PlatformId};
10+
use trust_quorum::{Envelope, Epoch, PlatformId};
1111

1212
/// An event that can be fed into our system under test (SUT)
1313
///
@@ -22,12 +22,30 @@ pub enum Event {
2222
},
2323
AbortConfiguration(Epoch),
2424
SendNexusReplyOnUnderlay(NexusReply),
25-
/// Pull an envelope off the bootstrap network and call `Node::handle`
26-
DeliverEnvelope {
27-
destination: PlatformId,
28-
},
25+
/// Call `Node::handle` with the given Envelope.
26+
///
27+
/// Since replay is deterministic, we actually know what this value is,
28+
/// even though a prior event may not have yet sent the message.
29+
DeliverEnvelope(Envelope),
2930
/// Pull a `NexusReply` off the underlay network and update the `NexusState`
30-
DeliverNexusReply,
31+
DeliverNexusReply(NexusReply),
3132
CommitConfiguration(PlatformId),
3233
Reconfigure(NexusConfig),
3334
}
35+
36+
impl Event {
37+
/// Return which nodes the event may have mutated.
38+
pub fn affected_nodes(&self) -> Vec<PlatformId> {
39+
match self {
40+
Self::InitialSetup { config, crashed_nodes, .. } => {
41+
config.members.union(&crashed_nodes).cloned().collect()
42+
}
43+
Self::AbortConfiguration(_) => vec![],
44+
Self::SendNexusReplyOnUnderlay(_) => vec![],
45+
Self::DeliverEnvelope(envelope) => vec![envelope.to.clone()],
46+
Self::DeliverNexusReply(_) => vec![],
47+
Self::CommitConfiguration(id) => vec![id.clone()],
48+
Self::Reconfigure(_) => vec![],
49+
}
50+
}
51+
}

trust-quorum/test-utils/src/nexus.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,9 @@ pub struct NexusState {
113113
impl NexusState {
114114
#[allow(clippy::new_without_default)]
115115
pub fn new() -> NexusState {
116-
NexusState { rack_id: RackUuid::new_v4(), configs: IdOrdMap::new() }
116+
// We end up replaying events in tqdb, and can't use a random rack
117+
// uuid.
118+
NexusState { rack_id: RackUuid::nil(), configs: IdOrdMap::new() }
117119
}
118120

119121
// Create a `ReconfigureMsg` for the latest nexus config

trust-quorum/test-utils/src/state.rs

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -201,11 +201,11 @@ impl TqState {
201201
Event::SendNexusReplyOnUnderlay(reply) => {
202202
self.apply_event_send_nexus_reply_on_underlay(reply)
203203
}
204-
Event::DeliverEnvelope { destination } => {
205-
self.apply_event_deliver_envelope(destination);
204+
Event::DeliverEnvelope(envelope) => {
205+
self.apply_event_deliver_envelope(envelope);
206206
}
207-
Event::DeliverNexusReply => {
208-
self.apply_event_deliver_nexus_reply();
207+
Event::DeliverNexusReply(reply) => {
208+
self.apply_event_deliver_nexus_reply(reply);
209209
}
210210
Event::CommitConfiguration(dest) => {
211211
self.apply_event_commit(dest);
@@ -273,9 +273,10 @@ impl TqState {
273273
self.underlay_network.push(reply);
274274
}
275275

276-
fn apply_event_deliver_nexus_reply(&mut self) {
276+
fn apply_event_deliver_nexus_reply(&mut self, recorded_reply: NexusReply) {
277277
let mut latest_config = self.nexus.latest_config_mut();
278278
let reply = self.underlay_network.pop().expect("reply exists");
279+
assert_eq!(recorded_reply, reply);
279280
match reply {
280281
NexusReply::AckedPreparesFromCoordinator { epoch, acks } => {
281282
if epoch == latest_config.epoch {
@@ -301,13 +302,21 @@ impl TqState {
301302
latest_config.op = NexusOp::Aborted;
302303
}
303304

304-
fn apply_event_deliver_envelope(&mut self, destination: PlatformId) {
305+
fn apply_event_deliver_envelope(&mut self, recorded_envelope: Envelope) {
305306
let envelope = self
306307
.bootstrap_network
307-
.get_mut(&destination)
308+
.get_mut(&recorded_envelope.to)
308309
.unwrap()
309310
.pop()
310311
.expect("envelope in bootstrap network");
312+
313+
// The recorded envelope must be exactly the same as the one pulled
314+
// off the bootstrap network. We ignore crypto data because we don't
315+
// currently seed and track random number generators. For our purposes,
316+
// validating the other fields is enough, because the test will fail if
317+
// the crypto doesn't work and decrypt to the same plaintext.
318+
assert!(recorded_envelope.equal_except_for_crypto_data(&envelope));
319+
311320
let (node, ctx) =
312321
self.sut.nodes.get_mut(&envelope.to).expect("destination exists");
313322
node.handle(ctx, envelope.from, envelope.msg);

0 commit comments

Comments
 (0)