-
Notifications
You must be signed in to change notification settings - Fork 63
[reconfigurator] Pre-checks and post_update actions for RoT bootloader update #8325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
2baa799
8072c7d
2de9298
f578e35
fa1ba5e
3ced091
c15ca08
4fb917a
dff0254
c6be847
25b0f40
e06418e
b93303f
0fbdcf1
609122c
0d40a73
92e97e3
69a8a4f
2fd4e71
99bd5da
dc9e787
0e76cce
486b54b
fec4545
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -4,6 +4,7 @@ | |||||||||||||
|
|
||||||||||||||
| //! Concurrent-safe facilities for doing MGS-managed upates | ||||||||||||||
|
|
||||||||||||||
| use crate::common_sp_update::PostUpdateError; | ||||||||||||||
| use crate::common_sp_update::PrecheckError; | ||||||||||||||
| use crate::common_sp_update::PrecheckStatus; | ||||||||||||||
| use crate::common_sp_update::STATUS_POLL_INTERVAL; | ||||||||||||||
|
|
@@ -32,7 +33,7 @@ use uuid::Uuid; | |||||||||||||
|
|
||||||||||||||
| /// How long may the status remain unchanged without us treating this as a | ||||||||||||||
| /// problem? | ||||||||||||||
| pub const PROGRESS_TIMEOUT: Duration = Duration::from_secs(120); | ||||||||||||||
| pub const PROGRESS_TIMEOUT: Duration = Duration::from_secs(180); | ||||||||||||||
karencfv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| /// How long to wait between failed attempts to reset the device | ||||||||||||||
| const RESET_DELAY_INTERVAL: Duration = Duration::from_secs(10); | ||||||||||||||
|
|
@@ -46,6 +47,14 @@ pub const DEFAULT_RETRY_TIMEOUT: Duration = Duration::from_secs(60); | |||||||||||||
| /// How long to wait after resetting the device before expecting it to come up | ||||||||||||||
| const RESET_TIMEOUT: Duration = Duration::from_secs(60); | ||||||||||||||
|
|
||||||||||||||
| /// How long to wait for an ongoing RoT bootloader update | ||||||||||||||
| const WAIT_FOR_ONGOING_ROT_BOOTLOADER_UPDATE_TIMEOUT: Duration = | ||||||||||||||
| Duration::from_secs(180); | ||||||||||||||
|
|
||||||||||||||
| /// How long to wait between poll attempts on RoT bootloader update status | ||||||||||||||
| const ROT_BOOLOADER_UPDATE_PROGRESS_INTERVAL: Duration = | ||||||||||||||
| Duration::from_secs(10); | ||||||||||||||
|
|
||||||||||||||
| /// Parameters describing a request to update one SP-managed component | ||||||||||||||
| /// | ||||||||||||||
| /// This is similar in spirit to the `SpComponentUpdater` trait but uses a | ||||||||||||||
|
|
@@ -216,7 +225,11 @@ pub(crate) async fn apply_update( | |||||||||||||
| // - if not, then if our required preconditions are met | ||||||||||||||
| status.update(UpdateAttemptStatus::Precheck); | ||||||||||||||
| match update_helper.precheck(log, &mut mgs_clients, update).await { | ||||||||||||||
| Ok(PrecheckStatus::ReadyForUpdate) => (), | ||||||||||||||
| Ok(PrecheckStatus::ReadyForUpdate) | | ||||||||||||||
| // This is the first time a Nexus instance is attempting to | ||||||||||||||
| // update the RoT bootloader, we don't need to wait for an | ||||||||||||||
| // ongoing update. | ||||||||||||||
| Ok(PrecheckStatus::WaitingForOngoingRotBootloaderUpdate) => (), | ||||||||||||||
|
||||||||||||||
| Ok(PrecheckStatus::UpdateComplete) => { | ||||||||||||||
| return Ok(UpdateCompletedHow::FoundNoChangesNeeded); | ||||||||||||||
| } | ||||||||||||||
|
|
@@ -349,16 +362,33 @@ pub(crate) async fn apply_update( | |||||||||||||
|
|
||||||||||||||
| if try_reset { | ||||||||||||||
| // We retry this until we get some error *other* than a communication | ||||||||||||||
| // error. There is intentionally no timeout here. If we've staged an | ||||||||||||||
| // update but not managed to reset the device, there's no point where | ||||||||||||||
| // we'd want to stop trying to do so. | ||||||||||||||
| // error or an RoT bootloader image error. There is intentionally no | ||||||||||||||
| // timeout here. If we've staged an update but not managed to reset | ||||||||||||||
| // the device, there's no point where we'd want to stop trying to do so. | ||||||||||||||
|
||||||||||||||
| // error or an RoT bootloader image error. There is intentionally no | |
| // timeout here. If we've staged an update but not managed to reset | |
| // the device, there's no point where we'd want to stop trying to do so. | |
| // error or some other transient error. There is intentionally no | |
| // timeout here. If we've staged an update but not managed to reset | |
| // the device, there's no point where we'd want to stop trying to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: add a is_transient() (or is_fatal()) to PostUpdateError. Then replace this whole match block with:
if !error.is_transient() {
let error = InlineErrorChain::new(&error);
error!(log, "post_update failed"; &error);
return Err(ApplyUpdateError::SpResetFailed(
error.to_string(),
));
}
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the caller-provided timeout here and bump that one if necessary to match the value you're using here? That seems a lot simpler to me than having multiple timeouts, some caller-provided and some hardcoded, plus special knowledge of which timeouts to use for which devices.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco is this implementation accurate with #7988 (comment) ? Or is there something I missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not. More on this in my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious for @jgallagher's take on this but it would seem nice to me if the generic parts of this package (this file, the driver, and
apply_update) didn't know so much about specific devices. This would preclude this type from including more specific typed errors likeRotImageError, but I believe the only thing consumers of this error type care about is that the error is fatal to the update attempt.So I'd consider renaming
RotCommunicationFailedtoTransientErrorandRotBootloaderImageErrortoFatalError. Both would just containmessage: String.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. Specifically,
RotBootloaderImageErrordoesn't really mean anything without context. I'll make these more genericThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e06418e