Skip to content

TLS#53

Closed
breardon2011 wants to merge 4 commits intomainfrom
tls
Closed

TLS#53
breardon2011 wants to merge 4 commits intomainfrom
tls

Conversation

@breardon2011
Copy link
Contributor

@breardon2011 breardon2011 commented Mar 10, 2026

Summary

Add self-contained TLS/DNS system using Let's Encrypt wildcard certs and Route53. Each regional deployment manages its own TLS without external proxy dependencies — SDK clients connect directly to workers over HTTPS.

Architecture

  • Server obtains *.{domain} wildcard cert via ACME DNS-01 challenge + Route53, uploads cert/key to S3, renews every 12h
  • Workers download shared cert from S3, serve TLS on :443, keep HTTP on :8080 for VPC-internal traffic
  • DNS: Workers register w-{id}.{domain} A records in Route53 on boot, delete on shutdown
  • Control plane returns https://w-xxx.domain as connectURL — SDK connects direct to worker

HA hardening

  • Cert before workers: Server blocks on ObtainOrRenew() before autoscaler starts — cert is in S3 before any worker launches
  • Retry with backoff: Workers retry S3 cert fetch 3x with exponential backoff (handles transient S3 issues)
  • DNS verification: Worker identity script verifies DNS resolution (up to 60s) before worker process starts
  • Health endpoint: Reports tls: ok|expiring_soon|expired|no_cert with expiry time, returns 503 when degraded
  • Stale DNS cleanup: Server-side periodic job (every 5min) removes Route53 A records for workers no longer in Redis registry (handles crash/termination without graceful shutdown)
  • Cert refresh: Workers re-fetch cert from S3 every 12h to pick up renewals

Configuration required

  • Route53 hosted zone for the sandbox domain
  • Server env vars: OPENSANDBOX_ROUTE53_HOSTED_ZONE_ID, OPENSANDBOX_ACME_EMAIL
  • Security group: open port 443 inbound
  • IAM: route53:ChangeResourceRecordSets, route53:ListResourceRecordSets, route53:GetChange for server; route53:ChangeResourceRecordSets + s3:GetObject for workers

New packages

  • internal/certmanager/ — server-side ACME manager + worker-side S3 cert fetcher
  • internal/dns/ — Route53 client (A records, TXT records, listing)
  • internal/controlplane/dns_cleaner.go — stale DNS record cleanup

Modified

  • internal/config/ — new env vars: ROUTE53_HOSTED_ZONE_ID, ACME_EMAIL, CERT_S3_BUCKET, CERT_S3_PREFIX
  • internal/compute/ec2.go — passes Route53 config in EC2 user data for worker DNS registration
  • internal/controlplane/redis_registry.goInternalHTTPAddr() for VPC-internal routing
  • internal/proxy/controlplane_proxy.go — uses VPC-internal HTTP for control-plane→worker traffic
  • internal/worker/http_server.go — removed /caddy/check, added StartTLSWithCert(), cert health reporting
  • cmd/server/main.go — cert manager init, DNS cleaner
  • cmd/worker/main.go — cert fetcher, HTTPS on :443
  • deploy/ec2/setup-instance.sh — removed Caddy, added Route53 DNS registration/cleanup/verification

@vercel
Copy link

vercel bot commented Mar 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
opensandbox Ready Ready Preview, Comment Mar 10, 2026 4:53pm

Request Review

@breardon2011 breardon2011 changed the title wip tls TLS Mar 10, 2026
@breardon2011 breardon2011 marked this pull request as ready for review March 10, 2026 16:40
@github-actions
Copy link

Preview Environment Destroyed

The preview environment dev-pr-53 has been torn down.
All AWS resources for this environment have been cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants