-
Notifications
You must be signed in to change notification settings - Fork 53
quiesce needs to keep track of blueprint ids #8919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dap/handoff-quiesce-1
Are you sure you want to change the base?
Conversation
Some sample output from the new Initial state:
Enabled blueprint execution:
Created a demo saga:
Start quiescing:
Complete the demo saga:
|
nexus/types/src/quiesce.rs
Outdated
/// whether a saga recovery operation is ongoing, and if one is: | ||
/// - what `reassignment_generation` was when it started | ||
/// - which blueprint id we'll be fully caught up to upon completion | ||
#[serde(skip)] // XXX-dap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XXX
here because we don't want to skip this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- sorry I missed that! This is a problem because we don't support a tuple in this context in the OpenAPI spec. I will replace it with a struct.
nexus/types/src/quiesce.rs
Outdated
} | ||
}; | ||
|
||
q.latch_drained_blueprint_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to latch this even if quiescing
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the function checks that.
edit: to be precise, it is not correct to latch the value in this case. The function latch_drained_blueprint_id
is intended to be called at any time and will only latch the state if appropriate, and it checks that. Is there a better name for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latch_blueprint_id_if_drained()
maybe?
dc85999
to
6c84ded
Compare
73fff49
to
6fa9d9d
Compare
I've had to force push this for the same reason as in #8875. The diff-of-diffs is virtually empty -- nothing notable changed in the sync-up. @jgallagher do you want to take another look? (I don't think it's necessary but it's up to you!) |
assert_eq!(qq.fully_drained_blueprint(), Some(blueprint3_id)); | ||
assert!(qq.is_fully_drained()); | ||
|
||
// Fully drained case 3: quiescing itself causes us to immediately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Fully drained case 3: quiescing itself causes us to immediately | |
// Fully drained case 4: quiescing itself causes us to immediately |
Could some of these cases become different, smaller tests?
@@ -353,7 +475,10 @@ impl SagaQuiesceHandle { | |||
"recovery_start() called twice without intervening \ | |||
recovery_done() (concurrent calls to recover()?)", | |||
); | |||
q.recovery_pending = Some(q.reassignment_generation); | |||
q.recovery_pending = Some(PendingRecovery { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(only semi-related to this PR)
I'm noticing that the pub async fn recover
really needs that "only-one-caller-at-a-time" property to not break preconditions, trip assertions, etc. But the signature is &self
, so Rust would be happy to allow concurrent calls.
WDYT about making it act on a &mut self
API? Would this be too onerous? it would help ensure that future usage of API also cannot be done concurrently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for pub async fn reassign_sagas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, but worry it will be pretty onerous. If this ends up being owned by a &Nexus
or whatever similar top-level object it might be pretty painful; those end up cloned and shared I think? Worth double-checking though.
/// Returns whether sagas are fully drained | ||
/// | ||
/// This condition is not permanent. New sagas can be re-assigned to this | ||
/// Nexus. | ||
pub fn is_fully_drained(&self) -> bool { | ||
fn is_fully_drained(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible this returns true if we have never tried to reassign sagas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so; it checks for first_recovery_complete
which should be false if we've never tried, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but recovery is a distinct step from re-assignment, right? Someone could call the pub async fn recover
without calling pub async fn reassign_sagas
This is the first part of #8859. This PR adds the logic to keep track of this. Once we have db_metadata_nexus records (currently #8845), the last bit of 8859 will be to update those records whenever this value changes.
This is still a work in progress. I need to add some new tests and also put this into
omdb
.