Fix missing load snapshot after manual snapshot update by mschuwalow · Pull Request #2955 · golemcloud/golem

mschuwalow · 2026-03-10T00:38:16Z

This fixes two bugs:

Duplicate enqueued updates

When the agent does not have any pending work in the active queue and checks pending_updates, it was always performing a manual update when it had a pending update. That is incorrect in two ways:

if the update is an automatic update a manual update was enqueued immediately afterwards
if the update was a manual update, it enqueued a duplicate manual update. If the manual update made it the pending_updates queue (instead of the pending invocation queue), it means the save-snapshot was already performed and the agent is ready for restart / load-snapshot

This is a bit difficult to reproduce in our current tests as it only shows up in a particular interleaving due to a race with the interupt / restart after an update is enqueued. The bug can be reproduced consistently if that restart is commented out.

Manual updates do not call load-snapshot on replay start

Manual updates do not call load-snapshot on replay start, instead just resuming from the update-succeeded oplog entry. This results in agents failing due to them being uninitialized (original initialize is skipped due to the update, but load-snapshot is not called). The fix for this is to do the same as for the automatic snapshots and track the last save-snapshot payload that needs to be loaded.

mschuwalow · 2026-03-10T00:50:09Z

golem-worker-executor/src/worker/mod.rs

+            .last_automatic_snapshot_index
+        {
+            // automatic snapshots are only considered until the first failure
+            // if there are updates, ignore the automatic snapshot temporarily to catch issues earlier


I assume that is why the pending_update.is_none() case was there

vigoo · 2026-03-10T17:15:36Z

golem-worker-executor/src/worker/invocation_loop.rs

-                    CommandOutcome::Continue => continue,
-                    other => break other,
-                }
+            if status.pending_updates.front().is_some() {


Are we sure that there is always an AgentInvocation::ManualUpdate enqueued and processed for saving the snapshot before we reach here? Probably this logic became a bit obscure through all the refactorings.
What if there are multiple manual updates enqueued? When we perform the first, and restart, what enqueues the thing in the command queue?

Yes. pending_updates is only based on the pending_update oplog entry, which for manual snapshot updates is only created during the following path:

agent receives update request via grpc -> agent writes a manual update pending_agent_invocation oplog entry -> agent processes the invocation -> agent writes a pending_update oplog entry

For automatic updates the pending_update oplog entry is written immediately

vigoo · 2026-03-10T17:20:57Z

golem-worker-executor/src/durable_host/mod.rs

            OplogEntry::Snapshot {
                data, mime_type, ..
            } => (data, mime_type),
+            OplogEntry::PendingUpdate {


I understand this fixes the main issue, and now that we have the snapshot-based recovery this is probably a nice way to fix it but very confused that there was no machinery for this earlier.

I think it was supposed to be something like this:

once a manual update succeeds, it adds a skipped region from the beginning to the update point so that oplog part is ignored

so on next recovery, we reach the OplogEntry::PendingUpdate { SnapshotBased } again, and the recovery code calls load-snapshot so our state is restored

the rest of the oplog gets replayed

So I don't fully understand why this did not work, and whether it is a problem or not that now we have two ways to recover from a snapshot-based update. (As I think you did not touch the old one)

I believe this was always broken or at least broken for a very long time.

PendingUpdate is a hint entry and I don't see any logic in replay that would replay load-snapshot. This is surprising to me too, maybe I missed something

My understanding of how it work(ed) on main:

manual update request arrives, it gets enqueued in the invocation queue (because it needs to save snapshot between invocations) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/grpc/mod.rs#L1130)

the invocation loop picks it up and runs manual_update - invokes the save-snapshot function, adds a pending update entry with the actual snapshot and restarts the agent (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/invocation_loop.rs#L718)

The pending update applies a temporary "skipped region override" that skips everything after CREATE until the pending update entry. It's an override and not just a "set", because in case of a failed update attempt we undo it (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/status.rs#L405-L415)

because of the pending update entry, the agent reloads with the new component version (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/mod.rs#L2102-L2116)

it gets to prepare_instance, that finds that we are executing a pending snapshot-based update, and skips the replay, instead goes to finalize_pending_snapshot_update (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L2320-L2323)

here we call load-snapshot (the agent sdk is supposed to internally intialize the agent as part of this) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1048-L1055)

if it fails we mark it as a failed update, the pending update is gone from the queue, and we restart so we continue with the previous version (or pick a next pending update)

if it succeeds, we record a successful update oplog entry (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1832-L1846) and the agent is considered ready

The worker status calculation collapses the skipped region (basically making the override final): https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/worker/status.rs#L417-L422

I think the part that's hard to see/understand is the skipped region logic that now 100% lives in the worker status calculation. Previously it was explicitly set in various points of the above update logic, making it easier to follow (but the current way of calculating everything directly from the oplog is definitely the correct way)

What you are saying is exactly correct, but it's not doing this part in any following replays as pending_updates will not contain the update anymore (as it now has a successful update entry)

here we call load-snapshot (the agent sdk is supposed to internally intialize the agent as part of this) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1048-L1055)

vigoo · 2026-03-10T17:22:20Z

golem-worker-executor/src/durable_host/mod.rs

-                            record_resume_worker(start.elapsed());
+                            let replay_result = async {
+                                if let SnapshotRecoveryResult::Failed =
+                                    Self::try_load_snapshot(store, instance).await


Maybe I misunderstand this part - but for automatic updates we should never use snapshots, but replay always from the beginning.

This is only using snapshots created by manual updates, not the automatic snapshots (the logic for that lives in the WorkerConfig creation).

I don't think we can replay the parts of the oplog that were skipped as part of a manual oplog with a new component version in general (as manual snapshot updates can freely break oplog backwards compat). So I think only replaying from the last succesful manual snapshot update is correct here.

vigoo · 2026-03-10T17:24:26Z

golem-common/src/model/mod.rs

-    pub last_snapshot_index: Option<OplogIndex>,
+    /// Index of the last manual update snapshot index. Agent will call load_snapshot
+    /// on this payload before starting replay.
+    pub last_manual_snapshot_index: Option<OplogIndex>,


The name is a bit confusing to me without the comments, how about

last_manual_update_snapshot_index

last_snapshot_index (or we can have automatic in it but what I'd like is that one is for updates, and the other has nothing to do with updates - and automatic updates is a thing which has nothing to do with this)

let's do last_manual_update_snapshot_index + last_automatic_snapshot_index 👍

~~Will update~~ done

mschuwalow self-assigned this Mar 10, 2026

mschuwalow force-pushed the snapshot-update-fix branch 3 times, most recently from 88e7b4d to 56ad0f0 Compare March 10, 2026 00:49

mschuwalow commented Mar 10, 2026

View reviewed changes

mschuwalow force-pushed the snapshot-update-fix branch 2 times, most recently from ee5ab96 to 893eb8b Compare March 10, 2026 01:04

Fix missing load snapshot after manual snapshot update

4a9a347

mschuwalow force-pushed the snapshot-update-fix branch from 893eb8b to 4a9a347 Compare March 10, 2026 01:35

mschuwalow marked this pull request as ready for review March 10, 2026 01:36

mschuwalow requested a review from vigoo March 10, 2026 09:27

vigoo reviewed Mar 10, 2026

View reviewed changes

comments

0b7461a

vigoo approved these changes Mar 11, 2026

View reviewed changes

mschuwalow merged commit 91d5df9 into main Mar 11, 2026
50 of 51 checks passed

mschuwalow deleted the snapshot-update-fix branch March 11, 2026 14:04

github-actions bot locked and limited conversation to collaborators Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing load snapshot after manual snapshot update#2955

Fix missing load snapshot after manual snapshot update#2955
mschuwalow merged 2 commits intomainfrom
snapshot-update-fix

mschuwalow commented Mar 10, 2026 •

edited

Loading

Uh oh!

mschuwalow Mar 10, 2026

Uh oh!

vigoo Mar 10, 2026

Uh oh!

mschuwalow Mar 10, 2026 •

edited

Loading

Uh oh!

vigoo Mar 10, 2026

Uh oh!

mschuwalow Mar 10, 2026

Uh oh!

vigoo Mar 11, 2026

Uh oh!

mschuwalow Mar 11, 2026

Uh oh!

vigoo Mar 10, 2026

Uh oh!

mschuwalow Mar 10, 2026

Uh oh!

vigoo Mar 10, 2026

Uh oh!

mschuwalow Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mschuwalow commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Duplicate enqueued updates

Manual updates do not call load-snapshot on replay start

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mschuwalow Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mschuwalow Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mschuwalow commented Mar 10, 2026 •

edited

Loading

mschuwalow Mar 10, 2026 •

edited

Loading

mschuwalow Mar 10, 2026 •

edited

Loading