-
Notifications
You must be signed in to change notification settings - Fork 53
(5/N) Read database access records on boot #8925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,8 @@ use nexus_db_model::AllSchemaVersions; | |
use nexus_db_model::SCHEMA_VERSION; | ||
use nexus_db_queries::db; | ||
use nexus_db_queries::db::DataStore; | ||
use nexus_db_queries::db::datastore::DatastoreSetupAction; | ||
use nexus_db_queries::db::datastore::IdentityCheckPolicy; | ||
use semver::Version; | ||
use slog::Drain; | ||
use slog::Level; | ||
|
@@ -108,11 +110,40 @@ async fn main_impl() -> anyhow::Result<()> { | |
} | ||
Cmd::Upgrade { version } => { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not necessarily on this PR - should we do more to discourage the use of this tool? If something goes wrong with handoffs we'll needed it, but in general we shouldn't run this anymore (once the whole stack of work lands), right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. totally agreed - I think the main usage of this is from https://github.com/oxidecomputer/meta/blob/master/engineering/dogfood/overview.adoc , which we can and should patch once we're ready to go to the full "online update" world for Nexus. |
||
println!("Upgrading to {version}"); | ||
datastore | ||
.ensure_schema(&log, version.clone(), Some(&all_versions)) | ||
.await | ||
.map_err(|e| anyhow!(e))?; | ||
println!("Upgrade to {version} complete"); | ||
let checked_action = datastore | ||
.check_schema_and_access( | ||
IdentityCheckPolicy::DontCare, | ||
version.clone(), | ||
) | ||
.await?; | ||
|
||
match checked_action.action() { | ||
DatastoreSetupAction::Ready => { | ||
println!("Already at version {version}") | ||
} | ||
DatastoreSetupAction::Update => { | ||
datastore | ||
.update_schema(checked_action, Some(&all_versions)) | ||
.await | ||
.map_err(|e| anyhow!(e))?; | ||
println!("Update to {version} complete"); | ||
} | ||
DatastoreSetupAction::Refuse => { | ||
println!("Refusing to update to version {version}") | ||
} | ||
DatastoreSetupAction::TryLater | ||
| DatastoreSetupAction::NeedsHandoff { .. } => { | ||
// This case should not happen - we supplied | ||
// IdentityCheckPolicy::DontCare, so we should not be told | ||
// to attempt a takeover by a specific Nexus. | ||
println!( | ||
"Refusing to update to version {version}. \ | ||
The schema updater tried to ignore the identity check, \ | ||
but got a response indicating handoff is needed. \ | ||
This is unexpected, and probably a bug" | ||
) | ||
} | ||
} | ||
} | ||
} | ||
datastore.terminate().await; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all the error cases here transient? Looking at 8932, maybe
NexusInWrongState
is a permanent error?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even in this case, I do want us to retry.
Right now, it's actually not possible for us to return from the
NexusInWrongState
case:quiesced
ornot_yet
"!= not_yet
, therefore it must bequiesced
.quiesced
, the previous call tocheck_schema_and_access
would have returnedRefuse
, rather thanNeedsHandoff
active
toquiesced
shouldn't be racy (without operator intervention) because each Nexus should be responsible for performing this transition for itself.So, TL;DR:
check_schema_and_access
again, before re-tryingattempt_handoff
quiesced
, we'll converge to "locked out of the db"I think that seeing a weird state in one of these branches, and choosing to "re-evaluate the world again from scratch" seems like a reasonable choice.
check_schema_and_access
should be able to determine a reasonable next step, and it only throws errors if we cannot access the database (which I consider to be transient).