-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
The Sept 30 workshop revealed that the private ECR repository can fall out of sync with the upstream ghcr.io/coder/coder-preview
image, leading to image version mismatches between the control plane (us-east-2) and proxy clusters (us-west-2, eu-west-2).
Context
- The platform uses
ghcr.io/coder/coder-preview
(not stablecoder/coder
) to access beta AI features - This image is mirrored to a private AWS ECR repository
- During the workshop, the ECR mirror was out of sync, causing subdomain routing failures
- Manual sync process is error-prone and doesn't scale
Current Manual Process
# Pull from GitHub Container Registry
docker pull ghcr.io/coder/coder-preview:latest
# Tag for ECR
docker tag ghcr.io/coder/coder-preview:latest <aws-account-id>.dkr.ecr.us-east-2.amazonaws.com/coder-preview:latest
# Authenticate with ECR
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin <aws-account-id>.dkr.ecr.us-east-2.amazonaws.com
# Push to ECR
docker push <aws-account-id>.dkr.ecr.us-east-2.amazonaws.com/coder-preview:latest
# Restart Coder pods in all regions
kubectl rollout restart deployment/coder -n coder --context=us-east-2
kubectl rollout restart deployment/coder -n coder --context=us-west-2
kubectl rollout restart deployment/coder -n coder --context=eu-west-2
Requirements
Automated Image Mirroring
- Implement automated job to sync
ghcr.io/coder/coder-preview
to ECR - Run sync on a schedule (daily or on new image push)
- Use GitHub Actions, AWS Lambda, or similar automation
- Include digest/tag verification to ensure successful sync
- Notify team on sync failures
Image Consistency Validation
- Add pre-deployment validation to verify image digests match across:
- GHCR source
- Private ECR mirror
- us-east-2 control plane
- us-west-2 proxy cluster
- eu-west-2 proxy cluster
- Block deployments if image inconsistencies detected
- Add to pre-workshop checklist (Create pre-workshop validation checklist and runbook #4)
Workspace Image Management
- Document which workspace template images are stored in ECR:
- Build from Scratch w/ Claude
- Build from Scratch w/ Goose
- Document which use public registries:
- Real World App w/ Claude (uses
codercom/example-universal:ubuntu
from DockerHub)
- Real World App w/ Claude (uses
- Consider mirroring workspace images to ECR for consistency
Rollback Strategy
- Document rollback procedure if bad image is mirrored
- Implement image tagging strategy (not just
latest
) - Consider using immutable tags or digests in deployment
Success Criteria
- ECR mirror automatically syncs with GHCR without manual intervention
- Image consistency validated before every workshop
- All clusters always run identical image digests
- Zero subdomain routing failures due to image mismatch
- Clear documentation for emergency manual sync if automation fails
Implementation Options
Option 1: GitHub Actions
- Trigger on new release of
coder/coder-preview
- Pull, tag, push to ECR
- Create PR to update image references in Terraform
Option 2: AWS Lambda + EventBridge
- Scheduled Lambda function (daily)
- Pull latest from GHCR, push to ECR
- Send SNS notification on failure
Option 3: Kubernetes CronJob
- Run in us-east-2 cluster
- Use service account with ECR push permissions
- Monitor via existing Kubernetes alerting
Related
Sept 30 Workshop Postmortem
#2 (Image management standardization)
Incident Runbook - Subdomain Routing Failures
Incident Runbook - Image Pull Failures
Metadata
Metadata
Assignees
Labels
No labels