Fix missing load snapshot after manual snapshot update#2955
Conversation
88e7b4d to
56ad0f0
Compare
| .last_automatic_snapshot_index | ||
| { | ||
| // automatic snapshots are only considered until the first failure | ||
| // if there are updates, ignore the automatic snapshot temporarily to catch issues earlier |
There was a problem hiding this comment.
I assume that is why the pending_update.is_none() case was there
ee5ab96 to
893eb8b
Compare
893eb8b to
4a9a347
Compare
| CommandOutcome::Continue => continue, | ||
| other => break other, | ||
| } | ||
| if status.pending_updates.front().is_some() { |
There was a problem hiding this comment.
Are we sure that there is always an AgentInvocation::ManualUpdate enqueued and processed for saving the snapshot before we reach here? Probably this logic became a bit obscure through all the refactorings.
What if there are multiple manual updates enqueued? When we perform the first, and restart, what enqueues the thing in the command queue?
There was a problem hiding this comment.
Yes. pending_updates is only based on the pending_update oplog entry, which for manual snapshot updates is only created during the following path:
agent receives update request via grpc -> agent writes a manual update pending_agent_invocation oplog entry -> agent processes the invocation -> agent writes a pending_update oplog entry
For automatic updates the pending_update oplog entry is written immediately
| OplogEntry::Snapshot { | ||
| data, mime_type, .. | ||
| } => (data, mime_type), | ||
| OplogEntry::PendingUpdate { |
There was a problem hiding this comment.
I understand this fixes the main issue, and now that we have the snapshot-based recovery this is probably a nice way to fix it but very confused that there was no machinery for this earlier.
I think it was supposed to be something like this:
- once a manual update succeeds, it adds a skipped region from the beginning to the update point so that oplog part is ignored
- so on next recovery, we reach the
OplogEntry::PendingUpdate { SnapshotBased }again, and the recovery code callsload-snapshotso our state is restored - the rest of the oplog gets replayed
So I don't fully understand why this did not work, and whether it is a problem or not that now we have two ways to recover from a snapshot-based update. (As I think you did not touch the old one)
There was a problem hiding this comment.
I believe this was always broken or at least broken for a very long time.
PendingUpdate is a hint entry and I don't see any logic in replay that would replay load-snapshot. This is surprising to me too, maybe I missed something
There was a problem hiding this comment.
My understanding of how it work(ed) on main:
- manual update request arrives, it gets enqueued in the invocation queue (because it needs to save snapshot between invocations) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/grpc/mod.rs#L1130)
- the invocation loop picks it up and runs
manual_update- invokes the save-snapshot function, adds a pending update entry with the actual snapshot and restarts the agent (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/invocation_loop.rs#L718) - The pending update applies a temporary "skipped region override" that skips everything after
CREATEuntil the pending update entry. It's an override and not just a "set", because in case of a failed update attempt we undo it (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/status.rs#L405-L415) - because of the pending update entry, the agent reloads with the new component version (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/mod.rs#L2102-L2116)
- it gets to
prepare_instance, that finds that we are executing a pending snapshot-based update, and skips the replay, instead goes tofinalize_pending_snapshot_update(https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L2320-L2323) - here we call
load-snapshot(the agent sdk is supposed to internally intialize the agent as part of this) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1048-L1055) - if it fails we mark it as a failed update, the pending update is gone from the queue, and we restart so we continue with the previous version (or pick a next pending update)
- if it succeeds, we record a successful update oplog entry (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1832-L1846) and the agent is considered ready
- The worker status calculation collapses the skipped region (basically making the override final): https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/status.rs#L417-L422
I think the part that's hard to see/understand is the skipped region logic that now 100% lives in the worker status calculation. Previously it was explicitly set in various points of the above update logic, making it easier to follow (but the current way of calculating everything directly from the oplog is definitely the correct way)
There was a problem hiding this comment.
What you are saying is exactly correct, but it's not doing this part in any following replays as pending_updates will not contain the update anymore (as it now has a successful update entry)
here we call load-snapshot (the agent sdk is supposed to internally intialize the agent as part of this) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1048-L1055)
| record_resume_worker(start.elapsed()); | ||
| let replay_result = async { | ||
| if let SnapshotRecoveryResult::Failed = | ||
| Self::try_load_snapshot(store, instance).await |
There was a problem hiding this comment.
Maybe I misunderstand this part - but for automatic updates we should never use snapshots, but replay always from the beginning.
There was a problem hiding this comment.
This is only using snapshots created by manual updates, not the automatic snapshots (the logic for that lives in the WorkerConfig creation).
I don't think we can replay the parts of the oplog that were skipped as part of a manual oplog with a new component version in general (as manual snapshot updates can freely break oplog backwards compat). So I think only replaying from the last succesful manual snapshot update is correct here.
golem-common/src/model/mod.rs
Outdated
| pub last_snapshot_index: Option<OplogIndex>, | ||
| /// Index of the last manual update snapshot index. Agent will call load_snapshot | ||
| /// on this payload before starting replay. | ||
| pub last_manual_snapshot_index: Option<OplogIndex>, |
There was a problem hiding this comment.
The name is a bit confusing to me without the comments, how about
last_manual_update_snapshot_indexlast_snapshot_index(or we can have automatic in it but what I'd like is that one is for updates, and the other has nothing to do with updates - and automatic updates is a thing which has nothing to do with this)
There was a problem hiding this comment.
let's do last_manual_update_snapshot_index + last_automatic_snapshot_index 👍
Will update done
This fixes two bugs:
Duplicate enqueued updates
When the agent does not have any pending work in the active queue and checks pending_updates, it was always performing a manual update when it had a pending update. That is incorrect in two ways:
This is a bit difficult to reproduce in our current tests as it only shows up in a particular interleaving due to a race with the interupt / restart after an update is enqueued. The bug can be reproduced consistently if that restart is commented out.
Manual updates do not call load-snapshot on replay start
Manual updates do not call load-snapshot on replay start, instead just resuming from the update-succeeded oplog entry. This results in agents failing due to them being uninitialized (original initialize is skipped due to the update, but load-snapshot is not called). The fix for this is to do the same as for the automatic snapshots and track the last save-snapshot payload that needs to be loaded.