Skip to content

Consistent network fixes#44

Closed
breardon2011 wants to merge 2 commits intomainfrom
consistent-network-fixes
Closed

Consistent network fixes#44
breardon2011 wants to merge 2 commits intomainfrom
consistent-network-fixes

Conversation

@breardon2011
Copy link
Contributor

@breardon2011 breardon2011 commented Mar 6, 2026

Sandbox Infrastructure Changes

1. Clock Skew Fix (Golden Snapshots)

Problem:
VMs restored from snapshots had frozen clocks, causing apt-get update failures (“release files from the future”).

Root Cause:
Firecracker LoadSnapshot supports clock_delta_us, but we never passed it.

Fix:

  • Added SnapshotedAt time.Time to SnapshotMeta

  • Added snapshotClockDeltaUs() to compute elapsed time since snapshot

    • Falls back to mem file mtime for legacy snapshots
  • Pass clock_delta_us to Firecracker on all restore paths:

    • Hibernate wake
    • Checkpoint restore
    • Warm fork
    • Golden snapshot create

Files:

  • internal/firecracker/snapshot.go
  • internal/firecracker/api.go

2. eth0 Network Fixes (Checkpoint Resume)

Problem:
Preview URLs stopped working after checkpoints because eth0 remained down.

Root Cause:
Checkpoint process brought eth0 down but never restored it.

Fix:

  • Restore network via reconfigureGuestNetwork() after agent reconnect

  • Restore eth0 on error paths:

    • PauseVM failure
    • drive copy failure
  • Consistent network quiesce:

    • ip addr flush dev eth0
    • ip link set eth0 down

File:

  • internal/firecracker/snapshot.go

3. Worker TLS via Cloudflare

Problem:
Workers were exposed over plain HTTP (http://IP:8080).

Solution:
Cloudflare proxy with origin certificates.

Flow:
CLI → https://w-*.workers.opencomputer.dev/
→ worker :443 (Cloudflare TLS)
Control plane → http://private_ip:8080

Changes:

  • Worker serves HTTPS on :443 if origin certs exist
  • EC2 bootstrap registers Cloudflare DNS record
  • Control plane routes using private IP
  • Autoscaler injects cloudflare.env
  • CLI WebSocket sends ping every 30s (avoid Cloudflare idle timeout)

Files:

  • cmd/worker/main.go
  • cmd/server/main.go
  • deploy/ec2/setup-instance.sh
  • internal/compute/ec2.go
  • internal/config/config.go
  • internal/controlplane/redis_registry.go
  • internal/proxy/controlplane_proxy.go
  • internal/worker/http_server.go
  • cmd/oc/internal/commands/shell.go

4. CLI Cleanup

  • Removed connectURL from human-readable CLI output
  • Still available in JSON output for SDK

File:

  • cmd/oc/internal/commands/sandbox.go# Sandbox Infrastructure Changes

@vercel
Copy link

vercel bot commented Mar 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
opensandbox Ready Ready Preview, Comment Mar 6, 2026 8:57pm

Request Review

@breardon2011 breardon2011 marked this pull request as ready for review March 6, 2026 21:08
This was referenced Mar 7, 2026
@motatoes motatoes closed this Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants