VM Snapshot Runtime — Replace Event Replay with WASM Snapshot/Restore #1298
TooTallNate
announced in
RFCs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
This RFC proposes an alternative runtime execution model for Workflow DevKit that replaces event-replay with WASM VM snapshotting. Instead of re-executing the workflow from the beginning and replaying the entire event log on every resumption, we:
quickjs-wasi)Motivation
The current event-replay model has scaling limitations:
The snapshot approach eliminates these scaling issues:
Prior Art:
quickjs-wasiThe
quickjs-wasipackage was built specifically to explore this technique. It compiles QuickJS-NG to WASM as a WASI reactor and provides:serializeSnapshot()/deserializeSnapshot()for persistent storagewasi.now) for deterministicDate.now()andMath.random()PRNG seedingThe core snapshot/restore mechanism has been validated:
quickjs-wasi@0.1.0quickjs-emscriptenwithquickjs-wasi(PR #394)Proposed Changes
What Changes vs What Stays the Same
Stays the same:
"use workflow"/"use step"directives, function syntax)Worldinterface for everything except snapshotsChanges:
vm.Contextto QuickJS WASM (viaquickjs-wasi)EventsConsumer) is replaced by snapshot/restoreworld.snapshotsstorage interfacequickjs-wasi's WASI options +Math.randomoverrideNew World Interface:
snapshotsThe
lastEventIdis stored alongside the snapshot data (not inside the VM) — it is storage-layer metadata. Inworld-local, this is a.jsonsidecar file next to the.binsnapshot. Inworld-vercel(S3), it would use S3'sx-amz-meta-metadata headers.Execution Flow
First Invocation (no snapshot)
runningworld.snapshots.load(runId)→null(no snapshot yet)wasi.now(): fixed timestamp fromrun.startedAtmemoryLimit,interruptHandlerfor safetyMath.randomwith seeded PRNG (seed:{runId}:{workflowName}:{startedAt})WORKFLOW_USE_STEP,sleep,createHookas host functions on VM globalsDeferredpromise, stores metadata → returns promise handle to QuickJSexecutePendingJobs()→ QuickJS suspends on the deferred promiseWorkflowSuspension)vm.snapshot()→QuickJS.serializeSnapshot()→world.snapshots.save(runId, bytes, { lastEventId })step_created/wait_created/hook_createdevents, queue step messages (same as today)Subsequent Invocations (snapshot exists)
world.snapshots.load(runId)→{ data, metadata }metadata.lastEventId(using existingevents.list()with cursor)QuickJS.deserializeSnapshot(data)→QuickJS.restore(snapshot, options)registerHostCallback()for step/hook/sleep functionsstep_completed→ find pending Deferred by correlationId →deferred.resolve(result)hook_received→ find pending Deferred →deferred.resolve(payload)wait_completed→ find pending Deferred →deferred.resolve()step_failed→ find pending Deferred →deferred.reject(error)vm.executePendingJobs()→ workflow code continues from where it left offlastEventId, create events, queue stepsrun_completedevent,world.snapshots.delete(runId)run_failedevent,world.snapshots.delete(runId)Pending Operation Tracking
The host maintains a
Map<correlationId, Deferred>to track which promises correspond to which steps/hooks/waits. The resolve/reject functions are stored on a QuickJS global object (e.g.globalThis.__resolvers[correlationId]) so they survive in the snapshot. After restore, the host reads__resolversto rebuild the pending operations map.Determinism
Determinism is still required even in the snapshot model because:
The deterministic context is provided by:
quickjs-wasi'swasi.now()option forDate.now()/new Date()Math.randomreplacement (seed:{runId}:{workflowName}:{startedAt})Step Execution Model
Step execution does not change. Only the workflow orchestration code (the
"use workflow"function) runs in QuickJS. Steps continue to run with full Node.js access in the step handler, exactly as they do today.In the workflow VM bundle, step function calls are replaced with
globalThis[Symbol.for("WORKFLOW_USE_STEP")]("stepId")proxies (same SWC compiler output as today). The only difference is that these proxies are now host functions backed byvm.newFunction()in quickjs-wasi instead of being JS functions in a Node.jsvm.Context.world-local Implementation
Filesystem storage under
{dataDir}/snapshots/:Implementation Phases
Phase 1: World Interface + world-local
snapshotstoStorageinterface inpackages/world/src/interfaces.tsSnapshotMetadatatypepackages/world-local/src/storage/snapshots-storage.tscreateStorage()Phase 2: Snapshot Runtime
packages/core/src/runtime/snapshot-runtime.tsuseStep,sleep,createHookPhase 3: Wiring
workflowEntrypoint()to delegate to appropriate runtimePhase 4: Testing
Phase 5: world-vercel + world-postgres
snapshotsstorage inworld-vercel(S3 with metadata headers)snapshotsstorage inworld-postgresOpen Questions
Feature gating: Should this be opt-in per-workflow, per-project, or a global runtime mode?
Snapshot size: A fresh QuickJS VM snapshot is ~256 KB. With user data it grows with the heap. Users should apply compression (gzip/zstd) for storage — the WASM memory compresses very well (~60-70% reduction). Should the World implementation handle compression transparently?
QuickJS compatibility: The compiled workflow bundle currently targets Node.js
vm.runInContext(). For QuickJS, we need to verify thatSymbol.for()works correctly and that the compiled output is compatible. May need minor SWC plugin adjustments.Snapshot lifecycle: When should snapshots be cleaned up? On
run_completed,run_failed,run_cancelled— same as hook cleanup. Should there be a TTL for orphaned snapshots?Recovery from snapshot corruption: If a snapshot fails to restore, the runtime should fall back to full event replay. Should this be automatic or explicit?
Stack size limits: QuickJS-NG disables
JS_SetMaxStackSizeon WASI, so deep recursion causes a WASM trap (not a catchable exception). Is this a concern for real workflow patterns?Beta Was this translation helpful? Give feedback.
All reactions