-
Notifications
You must be signed in to change notification settings - Fork 181
Fix missing load snapshot after manual snapshot update #2955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1173,6 +1173,13 @@ impl<Ctx: WorkerCtx + DurableWorkerCtxView<Ctx>> DurableWorkerCtx<Ctx> { | |
| OplogEntry::Snapshot { | ||
| data, mime_type, .. | ||
| } => (data, mime_type), | ||
| OplogEntry::PendingUpdate { | ||
| description: | ||
| UpdateDescription::SnapshotBased { | ||
| payload, mime_type, .. | ||
| }, | ||
| .. | ||
| } => (payload, mime_type), | ||
| _ => { | ||
| warn!( | ||
| "Expected Snapshot entry at oplog index {snapshot_index}, found different entry; falling back to full replay" | ||
|
|
@@ -2325,11 +2332,25 @@ impl<Ctx: WorkerCtx + DurableWorkerCtxView<Ctx>> ExternalOperations<Ctx> for Dur | |
| UpdateDescription::Automatic { | ||
| target_revision, .. | ||
| } => { | ||
| // snapshot update will be succeeded as part of the replay. | ||
| let result = Self::resume_replay(store, instance, false).await; | ||
| record_resume_worker(start.elapsed()); | ||
| let replay_result = async { | ||
| if let SnapshotRecoveryResult::Failed = | ||
| Self::try_load_snapshot(store, instance).await | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I misunderstand this part - but for automatic updates we should never use snapshots, but replay always from the beginning.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is only using snapshots created by manual updates, not the automatic snapshots (the logic for that lives in the WorkerConfig creation). I don't think we can replay the parts of the oplog that were skipped as part of a manual oplog with a new component version in general (as manual snapshot updates can freely break oplog backwards compat). So I think only replaying from the last succesful manual snapshot update is correct here. |
||
| { | ||
| return Err(WorkerExecutorError::failed_to_resume_worker( | ||
| agent_id.clone(), | ||
| WorkerExecutorError::runtime("loading snapshot failed"), | ||
| )); | ||
| }; | ||
| // automatic update will be succeeded as part of the replay. | ||
| let result = Self::resume_replay(store, instance, false).await?; | ||
|
|
||
| record_resume_worker(start.elapsed()); | ||
|
|
||
| Ok(result) | ||
| } | ||
| .await; | ||
|
|
||
| match result { | ||
| match replay_result { | ||
| Err(error) => { | ||
| // replay failed. There are two cases here: | ||
| // 1. We failed before the update has succeeded. In this case we fail the update and retry the replay. | ||
|
|
@@ -2367,18 +2388,13 @@ impl<Ctx: WorkerCtx + DurableWorkerCtxView<Ctx>> ExternalOperations<Ctx> for Dur | |
| _ => Err(error), | ||
| } | ||
| } | ||
| _ => result, | ||
| _ => replay_result, | ||
| } | ||
| } | ||
| } | ||
| } | ||
| None => match Self::try_load_snapshot(store, instance).await { | ||
| SnapshotRecoveryResult::Success => { | ||
| let result = Self::resume_replay(store, instance, false).await; | ||
| record_resume_worker(start.elapsed()); | ||
| result | ||
| } | ||
| SnapshotRecoveryResult::NotAttempted => { | ||
| SnapshotRecoveryResult::Success | SnapshotRecoveryResult::NotAttempted => { | ||
| let result = Self::resume_replay(store, instance, false).await; | ||
| record_resume_worker(start.elapsed()); | ||
| result | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -327,22 +327,13 @@ impl<Ctx: WorkerCtx> InnerInvocationLoop<'_, Ctx> { | |
| /// first pending_updates, then pending_invocations | ||
| async fn drain_pending_from_status(&mut self) -> CommandOutcome { | ||
| loop { | ||
| let status = self.parent.last_known_status.read().await.clone(); | ||
| let status = self.parent.get_non_detached_last_known_status().await; | ||
|
|
||
| // First, try to process a pending update | ||
| if let Some(update) = status.pending_updates.front() { | ||
| let target_revision = *update.description.target_revision(); | ||
| let mut store = self.store.lock().await; | ||
| let mut invocation = Invocation { | ||
| owned_agent_id: self.owned_agent_id.clone(), | ||
| parent: self.parent.clone(), | ||
| instance: self.instance, | ||
| store: store.deref_mut(), | ||
| }; | ||
| match invocation.manual_update(target_revision).await { | ||
| CommandOutcome::Continue => continue, | ||
| other => break other, | ||
| } | ||
| if status.pending_updates.front().is_some() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we sure that there is always an AgentInvocation::ManualUpdate enqueued and processed for saving the snapshot before we reach here? Probably this logic became a bit obscure through all the refactorings.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. pending_updates is only based on the pending_update oplog entry, which for manual snapshot updates is only created during the following path: agent receives update request via grpc -> agent writes a manual update pending_agent_invocation oplog entry -> agent processes the invocation -> agent writes a pending_update oplog entry For automatic updates the pending_update oplog entry is written immediately |
||
| // if the update made it to pending_updates (instead of pending invocations), it is ready | ||
| // to be processed on next restart. So just restart here and let the recovery logic take over | ||
| break CommandOutcome::BreakInnerLoop(RetryDecision::Immediate); | ||
| } | ||
|
|
||
| // Then, try to process a pending invocation | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this fixes the main issue, and now that we have the snapshot-based recovery this is probably a nice way to fix it but very confused that there was no machinery for this earlier.
I think it was supposed to be something like this:
OplogEntry::PendingUpdate { SnapshotBased }again, and the recovery code callsload-snapshotso our state is restoredSo I don't fully understand why this did not work, and whether it is a problem or not that now we have two ways to recover from a snapshot-based update. (As I think you did not touch the old one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was always broken or at least broken for a very long time.
PendingUpdate is a hint entry and I don't see any logic in replay that would replay load-snapshot. This is surprising to me too, maybe I missed something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of how it work(ed) on
main:manual_update- invokes the save-snapshot function, adds a pending update entry with the actual snapshot and restarts the agent (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/invocation_loop.rs#L718)CREATEuntil the pending update entry. It's an override and not just a "set", because in case of a failed update attempt we undo it (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/status.rs#L405-L415)prepare_instance, that finds that we are executing a pending snapshot-based update, and skips the replay, instead goes tofinalize_pending_snapshot_update(https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L2320-L2323)load-snapshot(the agent sdk is supposed to internally intialize the agent as part of this) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1048-L1055)I think the part that's hard to see/understand is the skipped region logic that now 100% lives in the worker status calculation. Previously it was explicitly set in various points of the above update logic, making it easier to follow (but the current way of calculating everything directly from the oplog is definitely the correct way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you are saying is exactly correct, but it's not doing this part in any following replays as pending_updates will not contain the update anymore (as it now has a successful update entry)