-
Notifications
You must be signed in to change notification settings - Fork 53
start updating quiesce for new Nexus handoff #8875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this is almost ready for review except that I think I'm going to need to have the test suite wait for the saga quiesce determination to be made before running the test (a lot of tests probably assume they can just go ahead and run sagas right away) and I'm working through some issues trying to do that. |
8235551
to
ecd6a00
Compare
161491a
to
127d5a8
Compare
As expected, I force-pushed this to sync up with #8863. I gather that branch will not be force-pushed again so I think this one should be stable, too. I'm still working through testing. |
@@ -5016,7 +5016,7 @@ async fn cmd_db_dns_diff( | |||
// Load the added and removed items. | |||
use nexus_db_schema::schema::dns_name::dsl; | |||
|
|||
let added = dsl::dns_name | |||
let mut added = dsl::dns_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file are so that omdb
output doesn't change as a result of unspecified sort order (in this case, of DNS records). This became a problem after I added the "second" Nexus zone to the blueprint (see other comment).
dev-tools/omdb/tests/successes.out
Outdated
+ test-suite-silo.sys A 127.0.0.1 | ||
+ test-suite-silo.sys (records: 2) | ||
+ A 127.0.0.1 | ||
+ AAAA 100::1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra records here and in other outputs below result from having added a "second" Nexus to the test suite / omicron-dev blueprint (see other comment).
dev-tools/omdb/tests/successes.out
Outdated
@@ -489,15 +493,21 @@ task: "nat_garbage_collector" | |||
|
|||
task: "blueprint_loader" | |||
configured period: every <REDACTED_DURATION>m <REDACTED_DURATION>s | |||
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing | |||
last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of the background tasks' output changed because previously there was no target blueprint when they had run. The changes in this PR make test suite startup block on the first blueprint having been loaded, which made a lot of these tasks find something to do.
dev-tools/omdb/tests/successes.out
Outdated
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms | ||
last completion reported error: no blueprint | ||
warning: unknown background task: "crdb_node_id_collector" (don't know how to interpret details: Object {"errors": Array [Object {"err": String("failed to fetch node ID for zone ..........<REDACTED_UUID>........... at http://[::1]:REDACTED_PORT: Communication Error: error sending request for url (http://[::1]:REDACTED_PORT/node/id): error sending request for url (http://[::1]:REDACTED_PORT/node/id): client error (Connect): tcp connect error: Connection refused (os error 146)"), "zone_id": String("..........<REDACTED_UUID>...........")}], "nsuccess": Number(0)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This curious line seems to be more fallout from having a target blueprint loaded by the time omdb queried background task status. This presumably always produced this error in the test suite because the test suite doesn't start a cockroach admin server, but we didn't notice this here because there wasn't a target blueprint loaded yet so we didn't try to do anything.
@@ -251,6 +230,27 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) { | |||
], | |||
// This one should fail because it has no parent. | |||
&["nexus", "blueprints", "diff", &initial_blueprint_id], | |||
// chicken switches: show and set | |||
&["nexus", "chicken-switches", "show", "current"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These got moved later in the sequence because they actually changed the output of omdb nexus blueprints show
, which is also tested in this sequence. That had changed because setting this chicken switch enabled the planner, which then went and made a new blueprint.
for task in &[ | ||
&self.background_tasks.task_internal_dns_config, | ||
&self.background_tasks.task_internal_dns_servers, | ||
&self.background_tasks.task_external_dns_config, | ||
&self.background_tasks.task_external_dns_servers, | ||
&self.background_tasks.task_external_endpoints, | ||
&self.background_tasks.task_inventory_collection, | ||
&self.background_tasks.task_blueprint_loader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is appropriate because we just inserted the first blueprint, so it makes sense to activate the blueprint loader so that it loads it.
This is important because now we're blocking saga enablement and so Nexus startup on having loaded and started executing the first blueprint. So without this, it could take quite a while for Nexus to notice there was a blueprint, enable sagas, and complete startup.
// - each Nexus created for testing gets its own id so they don't see each | ||
// others sagas and try to recover them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is ancient and (I hope) wrong. I think we always create a fresh context with a fresh database and don't need to generate a new id every time.
Fixing this was important to ensure that the sort order for omdb nexus blueprints show
was consistent when run in the context of the test suite. Otherwise, now that there are two Nexus zones, one of which has a fixed uuid, they could flip-flop in their order based on the uuid that got assigned here.
// Besides the Nexus that we just started, add an entry in the blueprint | ||
// for the Nexus that developers can start using | ||
// nexus/examples/config-second.toml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "second" Nexus instance that I mentioned in other comments here.
We have a useful dev flow for running a second Nexus instance against omicron-dev run-all
. I use this a lot, including for this change. It's broken by this PR without this change because the new Nexus doesn't find itself in the blueprint and doesn't know if it should come up quiesced or not.
This fix is janky, but I think not more so than the rest of the test suite's startup behavior.
Besides CI, I've tested that the manual quiesce process works basically the same way that it did before, the same way I tested it at demo day (using I also tested that if Nexus can't figure out if it's quiesced, it reports that and disallows sagas. Before I fixed the test suite to include this second Nexus in its blueprint, the new Nexus would report:
trying to query its quiesce state reports:
which was this:
and creating sagas fails as it should:
Another test that would be good to add is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about landing this on main without the followup work to coordinate Nexuses because we do want to test online update off of main. But maybe in tests we're not likely to have all that many sagas anyway, and if we end up with implicitly abandoned ones it's similarly not that big of a deal?
Ok(saga_quiesce | ||
.reassign_sagas(async || { | ||
// For any expunged Nexus zones, re-assign in-progress | ||
// sagas to some other Nexus. If this fails for some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this comment was here before, but can we refine "some other Nexus" to "ourself"?
nexus/src/app/quiesce.rs
Outdated
*q = new_state; | ||
true | ||
} | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expand _
to list all the cases? I find it hard to reason about // All other cases are ...
when they aren't listed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it in a previous iteration but I found it less clear. I can put that back and see what it looks like.
nexus/types/src/quiesce.rs
Outdated
/// cannot then re-enable sagas. | ||
pub fn set_quiescing(&self, quiescing: bool) { | ||
self.inner.send_if_modified(|q| { | ||
let new_state = if quiescing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - this is only used in the SagasAllowed::DisallowedUnknown
arm below - can we move it into that arm?
nexus/types/src/quiesce.rs
Outdated
); | ||
false | ||
} | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this match would be clearer if we nested the if
s in the two branches above, which (I think?) would let us drop this _
entirely; e.g.,
SagasAllowed::Allowed => {
// If sagas are currently allowed but we need to quiesce, switch to disallowing them.
if quiescing {
// ... all the stuff from above ...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I know what you mean. I think that would make it clearer what the current _
covers (though I tried to explain that with the comment). But I think it would be less clear what the bigger picture is. I think of this like: there are three interesting cases and they're represented by the non-_
arms here. With your change, two of the "interesting" cases are actually sub-branches of two of the arms, and the other two branches within those arms are the very same "uninteresting" case (quiesce state already matches what we're being asked to do)). I don't feel strongly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just really dislike _
and will take nearly any excuse to drop it. 😂
I think it's even less bad than that. The problem introduced by this PR if we're testing online update is that if we get far enough into the update to start the handoff process and at some point before the handoff completes there's an expunged Nexus zone with sagas assigned to it, then we'd wind up with implicitly abandoned sagas. For this to happen, we must have expunged a Nexus outside the upgrade process with sagas assigned to it. That seems pretty unlikely in our testing, right? |
Ahh yeah great point; thanks. I'm much less concerned now. |
@jgallagher I applied the fix that I think you were suggesting for the failing test. Now we only add the "second" Nexus when running We might run into new problems soon? I'm not sure. With #8845 the new Nexus won't find its At the very least, this change allows this PR to not break things. |
dc85999
to
6c84ded
Compare
I've just sync'd this up and I think this is ready for final review (which I think should be trivial, relative to the previous state). Unfortunately though I've had to force push this branch. That's because this PR was previously based on #8863. Since then, most of what it depended on was split out and landed on "main", while other parts are no longer needed and aren't on "main". To sync this up, I've rebased this change onto "main" and force-pushed this branch. Sorry, I know that sucks. But in terms of mitigating the impact of this, I have diff'd the diffs:
I believe the only diffs here are what we'd expect: changes to "main" adjacent to the changes I was already making. Click below if you want to see the full diff: $ diff diffs_{before,after}.txt
184c184
< index 82577f26f..a5cdb0c79 100644
---
> index f0a22c50a..2bd070b5d 100644
187,189c187
< @@ -489,15 +489,21 @@ task: "nat_garbage_collector"
<
< task: "blueprint_loader"
---
> @@ -491,13 +491,19 @@ task: "blueprint_loader"
191,192c189
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
202,203c199
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by a dependent task completing
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
215c211
< last completed activation: <REDACTED ITERATIONS>, triggered by a dependent task completing
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
233,235c229
< @@ -541,9 +558,9 @@ warning: unknown background task: "chicken_switches_watcher" (don't know how to
<
< task: "crdb_node_id_collector"
---
> @@ -543,7 +560,7 @@ task: "crdb_node_id_collector"
237,238c231
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by a dependent task completing
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
245,246c238
< @@ -844,16 +861,22 @@ stdout:
< task: "blueprint_loader"
---
> @@ -844,13 +861,19 @@ task: "blueprint_loader"
248,250c240
< currently executing: no
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
260,262c250
< currently executing: no
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by a dependent task completing
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
272,273c260
< @@ -1001,16 +1024,22 @@ task: "nat_garbage_collector"
< task: "blueprint_loader"
---
> @@ -986,13 +1009,19 @@ task: "blueprint_loader"
275,277c262
< currently executing: no
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
287,289c272
< currently executing: no
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by a dependent task completing
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
299,301c282,284
< @@ -1049,7 +1078,18 @@ task: "blueprint_rendezvous"
< currently executing: no
< last completed activation: <REDACTED ITERATIONS>, triggered by a dependent task completing
---
> @@ -1026,7 +1055,18 @@ task: "blueprint_rendezvous"
> configured period: every <REDACTED_DURATION>m
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
319,320c302
< @@ -1061,9 +1101,9 @@ warning: unknown background task: "chicken_switches_watcher" (don't know how to
< task: "crdb_node_id_collector"
---
> @@ -1038,7 +1078,7 @@ task: "crdb_node_id_collector"
322,324c304
< currently executing: no
< - last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
< + last completed activation: <REDACTED ITERATIONS>, triggered by a dependent task completing
---
> last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
331c311
< @@ -1355,53 +1395,6 @@ task: "webhook_deliverator"
---
> @@ -1303,53 +1343,6 @@ task: "webhook_deliverator"
385c365
< @@ -1739,6 +1732,53 @@ stderr:
---
> @@ -1687,6 +1680,53 @@ stderr:
440c420
< index a46f21847..11db63156 100644
---
> index 8dcadc1cc..03b2f79a3 100644
443,446c423,426
< @@ -208,27 +208,6 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
< &["nexus", "background-tasks", "show", "dns_internal"],
< &["nexus", "background-tasks", "show", "dns_external"],
< &["nexus", "background-tasks", "show", "all"],
---
> @@ -224,27 +224,6 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
> "--no-executing-info",
> ],
> &["nexus", "background-tasks", "show", "all", "--no-executing-info"],
471c451
< @@ -251,6 +230,27 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
---
> @@ -267,6 +246,27 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
499c479
< @@ -264,7 +264,9 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
---
> @@ -280,7 +280,9 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
511c491
< index 7eb569ba3..22e84ae3f 100644
---
> index a72074887..fb62f1dd4 100644
526c506
< index 43c094855..a4f802b3c 100644
---
> index adfda2e59..e099f0c46 100644
529c509
< @@ -611,50 +611,37 @@ fn register_reassign_sagas_step<'a>(
---
> @@ -646,50 +646,37 @@ fn register_reassign_sagas_step<'a>(
611c591
< index 78f1ef566..495930b0b 100644
---
> index 51b6d1d86..afaf57f82 100644
631c611
< @@ -1029,6 +1030,8 @@ pub struct BackgroundTasksData {
---
> @@ -1028,6 +1029,8 @@ pub struct BackgroundTasksData {
784c764
< index 9ae27a227..752632e47 100644
---
> index d6519fdad..c33766e1e 100644
787c767
< @@ -278,18 +278,15 @@ impl BackgroundTask for BlueprintPlanner {
---
> @@ -273,18 +273,15 @@ impl BackgroundTask for BlueprintPlanner {
810c790
< @@ -429,7 +426,7 @@ mod test {
---
> @@ -423,7 +420,7 @@ mod test {
1282c1262
< index f84367946..be8d4762e 100644
---
> index 1a5a2847c..84b2bce45 100644
1307c1287
< @@ -840,47 +836,101 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> {
---
> @@ -835,47 +831,101 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> {
1427c1407
< @@ -891,11 +941,6 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> {
---
> @@ -886,11 +936,6 @@ impl<'a, N: NexusServer> ControlPlaneTestContextBuilder<'a, N> {
1439,1440c1419
< @@ -1621,6 +1666,7 @@ pub async fn omicron_dev_setup_with_config<N: NexusServer>(
< sim::SimMode::Auto,
---
> @@ -1617,6 +1662,7 @@ pub async fn omicron_dev_setup_with_config<N: NexusServer>(
1443d1421
< + true,
1444a1423
> + true,
1447,1448c1426,1427
< @@ -1642,6 +1688,7 @@ pub async fn test_setup_with_config<N: NexusServer>(
< sim_mode,
---
> }
> @@ -1638,6 +1684,7 @@ pub async fn test_setup_with_config<N: NexusServer>(
1451d1429
< + false,
1452a1431
> + false,
1455,1456c1434,1435
< @@ -1653,6 +1700,7 @@ async fn setup_with_config_impl<N: NexusServer>(
< sim_mode: sim::SimMode,
---
> }
> @@ -1649,6 +1696,7 @@ async fn setup_with_config_impl<N: NexusServer>(
1459d1437
< + second_nexus: bool,
1460a1439
> + second_nexus: bool,
1463c1442,1443
< @@ -1796,6 +1844,20 @@ async fn setup_with_config_impl<N: NexusServer>(
---
>
> @@ -1791,6 +1839,20 @@ async fn setup_with_config_impl<N: NexusServer>(
1642c1622
< index 7a2cca518..7f1abbb29 100644
---
> index 2c66a5f85..7d9c40b64 100644
1654c1634
< @@ -384,6 +386,26 @@ impl Blueprint {
---
> @@ -383,6 +385,26 @@ impl Blueprint {
1682c1662
< index f7db6d866..972a0b92d 100644
---
> index 49d744fc7..ab9a5922d 100644
1685c1665
< @@ -739,19 +739,28 @@ pub struct QuiesceStatus {
---
> @@ -996,19 +996,28 @@ pub struct QuiesceStatus {
1717c1697
< @@ -762,58 +771,51 @@ pub struct QuiesceStatus {
---
> @@ -1019,58 +1028,51 @@ pub struct QuiesceStatus {
2545c2525
< index bb98ae7c1..dc31e3d3c 100644
---
> index bc1cd3df5..caac20422 100644
2548c2528
< @@ -7524,8 +7524,23 @@
---
> @@ -7576,8 +7576,23 @@
2573c2553
< @@ -7542,7 +7557,7 @@
---
> @@ -7594,7 +7609,7 @@
2582c2562
< @@ -7560,7 +7575,7 @@
---
> @@ -7612,7 +7627,7 @@
2591c2571
< @@ -7570,13 +7585,13 @@
---
> @@ -7622,13 +7637,13 @@
2607c2587
< @@ -7585,14 +7600,50 @@
---
> @@ -7637,14 +7652,50 @@
2660c2640
< @@ -7608,13 +7659,16 @@
---
> @@ -7660,13 +7711,16 @@
2680c2660
< @@ -7627,9 +7681,10 @@
---
> @@ -7679,9 +7733,10 @@ After rebasing on "main", I had to make two additional changes related to a delta between what was in 8863 and what landed on "main" so far:
|
|
||
/// Given the current value of `nexus_generation`, set the new value for | ||
/// this blueprint. | ||
pub fn set_nexus_generation(&mut self, new_generation: Generation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API would presumably allow a caller to move the generation backwards, correct?
I'm working through adding a similar API to #8936 right now, and wanted to make this hard to mis-use -- is there really any value we want to accept here other than self.nexus_generation.next()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to confirm - aside from tests, we're still waiting on #8936 to actually change the nexus generations, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working through adding a similar API to #8936 right now, and wanted to make this hard to mis-use -- is there really any value we want to accept here other than
self.nexus_generation.next()
?
A thing that has come up in the past is that reconfigurator-cli
or unit tests may want to use the builder to intentionally set things up in unexpected / invalid ways to exercise test paths, and we have some methods that permit changes they could restrict to allow for that. We've also said if the change you want to make is really egregious, it's okay for the builder not to support that (which requires you to muck with blueprints by hand, usually).
I'm not sure which bucket this falls into. Is it worth testing what happens if the planner generation is behind one of its Nexus zones or more than one ahead, or are those impossible enough that they fall into the "really egregious" camp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to confirm - aside from tests, we're still waiting on #8936 to actually change the nexus generations, right?
Correct, in shipping code. The only thing in this PR that uses this is the new integration test in nexus/tests/integration_tests/quiesce.rs
.
This API would presumably allow a caller to move the generation backwards, correct?
That's correct.
I'm working through adding a similar API to #8936 right now, and wanted to make this hard to mis-use -- is there really any value we want to accept here other than self.nexus_generation.next()?
There are a bunch of options here and none seems particularly better to me than the others. The planner is the only caller that we ever expect to exist in shipping code. It will always bump it by exactly one, and only in a very specific situation in which it has probably already checked the current nexus_generation
. It'd be fine with any of these interfaces (set
that lets you set it to anything; set
that constrains it; or bump
).
In general, I do think it's essential to be able to construct blueprints in reconfigurator-cli
and for testing that do something different than the planner does. In this case, I'm not sure it would ever do something different than bump
. I can imagine a support activity in which the handoff has gone badly and we want to disable the autoplanner and install a custom blueprint with the previous Nexus generation so that things can come back up. I wouldn't go out of our way for this case (it's a support activity; we can always do it by hand) but it's an advantage of a simple set
.
It's more that I don't really see harm in allowing it to be set to a specific value at this low level. It's hard for me to see this being easily misused. I say this because it's not a required value in a struct, where you might try to fill in some default. You only call this function if you intend to set it to a specific value. I'm not sure what other value someone might think they need to set this to, or where they would have gotten it from. The only other plausible values I can think of are Generation::new()
as a default (but then, why would you be calling this function only to supply a default value?) or the one from the parent blueprint (which would be a no-op).
I'm going to leave this for now but I'm fine with evolving this as we get more experience with callers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I've tweaked #8936 to match this part of the builder. Matching this current interface seems fine.
.await | ||
.expect("setting new target"); | ||
|
||
// Wait for Nexus to quiesce. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's waiting for it to start quiescing; is that good enough, or should we be waiting for explicitly QuiesceState::Quiesced
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is specifically just for the behavior introduced in this PR that Nexus starts quiescing when the blueprint tells it to. For that, I don't think it's important to wait.
Depends on #8863.Fixes #8855 and #8856.Note: since this branches from 8863, if that branch gets force-pushed, this one will also need to be force-pushed. This may break your incremental review flow.This PR makes the initial changes required for Nexus quiesce described by RFD 588:
NexusQuiesceHandle
. This was so that that can be invoked from a background task.blueprint_execution
background task based on whether the current target blueprint says we should be handing off. (This is what fixes Reconfigurator: Start quiescing if target bpnexus_generation
is greater than ournexus_generation
#8855 and Nexus Boot: Start quiescing if target bpnexus_generation
is greater than ournexus_generation
#8856.) It does this even if blueprint execution is disabled.WaitingForSagas
->DrainingSagas
) and there's a newRecordingQuiesce
state that will cover the period after we've determined we're quiesced and before we've written the database record saying so.This PR does not change the quiesce process to coordinate among the Nexus instances (as described in RFD 588) nor update the
db_metadata_nexus
table. Both of these are blocked on #8845.The net result is a little awkward:
RecordingQuiesce
state that's basically unused (it is technically used, but we transition through it immediately)I think this is okay as an intermediate state, even on "main", since this cannot be triggered except by an online update, and we know we're going to fix these as part of shipping that.