Skip to content

testing fixes#60

Open
breardon2011 wants to merge 4 commits intomainfrom
fix/docs-testing-bugs
Open

testing fixes#60
breardon2011 wants to merge 4 commits intomainfrom
fix/docs-testing-bugs

Conversation

@breardon2011
Copy link
Contributor

@breardon2011 breardon2011 commented Mar 11, 2026

  1. Vsock drain delay (snapshot.go)

After agent.Close(), FIN-ACK packets linger in the vsock TX virtqueue. If PauseVM() runs immediately, vhost-vsock device state is corrupted in the snapshot — all
connections hang on restore. Fix: 500ms sleep between close and pause in all 4 snapshot paths (doHibernate, doSaveAsTemplate, CreateCheckpoint, PrepareGoldenSnapshot).
Added diagnostic logging in doWake and waitForAgent (manager.go) for vsock.sock state.

  1. Guest clock sync (snapshot.go)

Guest clock freezes at snapshot time. There's no NTP inside the VM — nothing was correcting it. The old clock_delta_us approach never worked (Firecracker doesn't support
that field), so every sandbox had a stale clock that drifted further with each hibernate/wake cycle. Fix: replaced the dead clock_delta_us system entirely — removed the
clockDeltaUs parameter from LoadSnapshot, deleted the snapshotClockDeltaUs helper, and added a new syncGuestClock() that sets the guest clock via date -s through the
agent after every snapshot restore. Called in 7 places covering all paths (wake, cold boot, checkpoint resume, fork, golden create). Verified 0-1s drift across all
paths.

  1. Exec via control plane (exec_session.go)

POST /api/sandboxes/:id/exec/run on the control plane panicked — s.manager is nil in server mode with no fallback. Fix: new execRunRemote() that looks up the sandbox's
worker in the DB and forwards the exec over gRPC.

  1. Checkpoint delete FK constraint (store.go)

Deleting a checkpoint with forked sandboxes referencing it returned 500 (FK violation). Fix: transaction that NULLs out based_on_checkpoint_id references before
deleting.

  1. Context.background()

In createFromGoldenSnapshot, the network reconfig and clock sync were using the HTTP request ctx. If the SDK client disconnects (timeout, user cancels, etc.) before
those steps finish, the context gets cancelled, leaving the sandbox with broken networking — even though the VM is running fine.

The fix switches to context.Background() for those post-restore steps so they always complete regardless of what the client does.

@vercel
Copy link

vercel bot commented Mar 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
opensandbox Ready Ready Preview, Comment Mar 12, 2026 1:12am

Request Review

@breardon2011 breardon2011 marked this pull request as ready for review March 11, 2026 22:49
@breardon2011 breardon2011 marked this pull request as draft March 12, 2026 00:30
@breardon2011 breardon2011 marked this pull request as ready for review March 12, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant