Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions clients/gateway-client/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,6 @@ serde.workspace = true
serde_json.workspace = true
schemars.workspace = true
slog.workspace = true
thiserror.workspace = true
uuid.workspace = true
omicron-workspace-hack.workspace = true
2 changes: 1 addition & 1 deletion clients/gateway-client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ progenitor::generate_api!(
HostPhase2RecoveryImageId = { derives = [PartialEq, Eq, PartialOrd, Ord] },
ImageVersion = { derives = [PartialEq, Eq, PartialOrd, Ord] },
RotImageDetails = { derives = [PartialEq, Eq, PartialOrd, Ord] },
RotImageError = { derives = [ PartialEq, Eq, PartialOrd, Ord] },
RotImageError = { derives = [ thiserror::Error, PartialEq, Eq, PartialOrd, Ord] },
RotState = { derives = [PartialEq, Eq, PartialOrd, Ord] },
SpComponentCaboose = { derives = [PartialEq, Eq] },
SpIdentifier = { derives = [Copy, PartialEq, Hash, Eq] },
Expand Down
1 change: 1 addition & 0 deletions nexus/mgs-updates/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ chrono.workspace = true
futures.workspace = true
gateway-client.workspace = true
gateway-types.workspace = true
gateway-messages.workspace = true
id-map.workspace = true
internal-dns-resolver.workspace = true
internal-dns-types.workspace = true
Expand Down
16 changes: 15 additions & 1 deletion nexus/mgs-updates/src/common_sp_update.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
use super::MgsClients;
use super::UpdateProgress;
use futures::future::BoxFuture;
use gateway_client::types::RotImageError;
use gateway_client::types::SpType;
use gateway_client::types::SpUpdateStatus;
use gateway_types::rot::RotSlot;
Expand Down Expand Up @@ -267,14 +268,15 @@ pub trait SpComponentUpdateHelper {
log: &'a slog::Logger,
mgs_clients: &'a mut MgsClients,
update: &'a PendingMgsUpdate,
) -> BoxFuture<'a, Result<(), GatewayClientError>>;
) -> BoxFuture<'a, Result<(), PostUpdateError>>;
}

/// Describes the live state of the component before the update begins
#[derive(Debug)]
pub enum PrecheckStatus {
UpdateComplete,
ReadyForUpdate,
WaitingForOngoingRotBootloaderUpdate,
}

#[derive(Debug, Error)]
Expand Down Expand Up @@ -319,6 +321,18 @@ pub enum PrecheckError {
WrongInactiveVersion { expected: ExpectedVersion, found: FoundVersion },
}

#[derive(Debug, thiserror::Error)]
pub enum PostUpdateError {
#[error("communicating with MGS")]
GatewayClientError(#[from] GatewayClientError),

#[error("communicating with RoT: {message:?}")]
RotCommunicationFailed { message: String },

#[error("invalid RoT bootloader image: {error:?}")]
RotBootloaderImageError { error: RotImageError },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious for @jgallagher's take on this but it would seem nice to me if the generic parts of this package (this file, the driver, and apply_update) didn't know so much about specific devices. This would preclude this type from including more specific typed errors like RotImageError, but I believe the only thing consumers of this error type care about is that the error is fatal to the update attempt.

So I'd consider renaming RotCommunicationFailed to TransientError and RotBootloaderImageError to FatalError. Both would just contain message: String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Specifically, RotBootloaderImageError doesn't really mean anything without context. I'll make these more generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e06418e

}

#[derive(Debug)]
pub enum FoundVersion {
MissingVersion,
Expand Down
66 changes: 57 additions & 9 deletions nexus/mgs-updates/src/driver_update.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

//! Concurrent-safe facilities for doing MGS-managed upates

use crate::common_sp_update::PostUpdateError;
use crate::common_sp_update::PrecheckError;
use crate::common_sp_update::PrecheckStatus;
use crate::common_sp_update::STATUS_POLL_INTERVAL;
Expand Down Expand Up @@ -32,7 +33,7 @@ use uuid::Uuid;

/// How long may the status remain unchanged without us treating this as a
/// problem?
pub const PROGRESS_TIMEOUT: Duration = Duration::from_secs(120);
pub const PROGRESS_TIMEOUT: Duration = Duration::from_secs(180);

/// How long to wait between failed attempts to reset the device
const RESET_DELAY_INTERVAL: Duration = Duration::from_secs(10);
Expand All @@ -46,6 +47,14 @@ pub const DEFAULT_RETRY_TIMEOUT: Duration = Duration::from_secs(60);
/// How long to wait after resetting the device before expecting it to come up
const RESET_TIMEOUT: Duration = Duration::from_secs(60);

/// How long to wait for an ongoing RoT bootloader update
const WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT: Duration =
Duration::from_secs(180);

/// How long to wait between poll attempts on RoT bootloader update status
const ROT_BOOLOADER_UPDATE_PROGRESS_INTERVAL: Duration =
Duration::from_secs(10);

/// Parameters describing a request to update one SP-managed component
///
/// This is similar in spirit to the `SpComponentUpdater` trait but uses a
Expand Down Expand Up @@ -216,7 +225,11 @@ pub(crate) async fn apply_update(
// - if not, then if our required preconditions are met
status.update(UpdateAttemptStatus::Precheck);
match update_helper.precheck(log, &mut mgs_clients, update).await {
Ok(PrecheckStatus::ReadyForUpdate) => (),
Ok(PrecheckStatus::ReadyForUpdate) |
// This is the first time a Nexus instance is attempting to
// update the RoT bootloader, we don't need to wait for an
// ongoing update.
Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => (),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is wrong, but it isn't what I had in mind in #7988 (comment). What I was proposing there was that:

  • precheck() would return:
    • ReadyForUpdate if it looks like no update is in progress (probably: stage0next is valid and matches stage0)
    • WaitForOngoingUpdate (nit: I wouldn't have this be specific to "RoT bootloader") if it looks like an update might be going on (probably: stage0next is invalid or it's valid but doesn't match stage0)
  • If we got WaitForOngoingUpdate here, we'd wait for up to PROGRESS_TIMEOUT for it to instead return ReadyForUpdate. If the timeout elapsed, we'd proceed as though we got ReadyForUpdate (but consider it like the "takeover" case -- log it as a takeover and report how accordingly).

The problem with what's here is that we don't know that there's no update ongoing and we might wind up trying to write to stage0next when some other update is trying to validate it and/or persist it. I think that would actually be fine if it happened once, but I don't see anything to prevent it from continuing to happen -- each Nexus constantly interrupting update attempts by other Nexus instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Ok(PrecheckStatus::UpdateComplete) => {
return Ok(UpdateCompletedHow::FoundNoChangesNeeded);
}
Expand Down Expand Up @@ -349,16 +362,33 @@ pub(crate) async fn apply_update(

if try_reset {
// We retry this until we get some error *other* than a communication
// error. There is intentionally no timeout here. If we've staged an
// update but not managed to reset the device, there's no point where
// we'd want to stop trying to do so.
// error or an RoT bootloader image error. There is intentionally no
// timeout here. If we've staged an update but not managed to reset
// the device, there's no point where we'd want to stop trying to do so.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// error or an RoT bootloader image error. There is intentionally no
// timeout here. If we've staged an update but not managed to reset
// the device, there's no point where we'd want to stop trying to do so.
// error or some other transient error. There is intentionally no
// timeout here. If we've staged an update but not managed to reset
// the device, there's no point where we'd want to stop trying to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

while let Err(error) =
update_helper.post_update(log, &mut mgs_clients, update).await
{
if !matches!(error, gateway_client::Error::CommunicationError(_)) {
let error = InlineErrorChain::new(&error);
error!(log, "post_update failed"; &error);
return Err(ApplyUpdateError::SpResetFailed(error.to_string()));
match error {
PostUpdateError::GatewayClientError(error) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: add a is_transient() (or is_fatal()) to PostUpdateError. Then replace this whole match block with:

if !error.is_transient() {
    let error = InlineErrorChain::new(&error);    
    error!(log, "post_update failed"; &error);
    return Err(ApplyUpdateError::SpResetFailed(
        error.to_string(),
    ));
}

if !matches!(
error,
gateway_client::Error::CommunicationError(_)
) {
let error = InlineErrorChain::new(&error);
error!(log, "post_update failed"; &error);
return Err(ApplyUpdateError::SpResetFailed(
error.to_string(),
));
}
}
PostUpdateError::RotBootloaderImageError { error } => {
let error = InlineErrorChain::new(&error);
error!(log, "post_update failed"; &error);
return Err(ApplyUpdateError::SpResetFailed(
error.to_string(),
));
}
PostUpdateError::RotCommunicationFailed { message: _ } => {}
}

tokio::time::sleep(RESET_DELAY_INTERVAL).await;
Expand Down Expand Up @@ -598,6 +628,24 @@ async fn wait_for_update_done(
// Check if we're done.
Ok(PrecheckStatus::UpdateComplete) => return Ok(()),

// We'll loop for 3 minutes to wait for any ongoing RoT bootloader update.
// We need to wait for 2 resets which have a timeout of 60 seconds each,
// and an attempt to retrieve boot info, which has a time out of 30 seconds.
// We give an additional 30 seconds to as a buffer for the other actions.
Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => {
if before.elapsed()
>= WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the caller-provided timeout here and bump that one if necessary to match the value you're using here? That seems a lot simpler to me than having multiple timeouts, some caller-provided and some hardcoded, plus special knowledge of which timeouts to use for which devices.

{
return Err(UpdateWaitError::Timeout(
WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT,
));
}

tokio::time::sleep(ROT_BOOLOADER_UPDATE_PROGRESS_INTERVAL)
.await;
continue;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davepacheco is this implementation accurate with #7988 (comment) ? Or is there something I missed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not. More on this in my comment above.


// An incorrect version in the "inactive" slot, incorrect active slot,
// or non-empty pending_persistent_boot_preference/transient_boot_preference
// are normal during the upgrade. We have no reason to think these won't
Expand Down
Loading
Loading