|
| 1 | +# Fly.io Scale-Down Recovery Plan |
| 2 | + |
| 3 | +Recovery plan for kcd production when LiteFS/deployment issues cause cascading failures. Scale to a single instance, deploy the fix, verify, then scale back up. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +- Fly CLI logged in (`fly auth whoami`) |
| 8 | +- FLY_MACHINE_ID fix committed and ready to deploy |
| 9 | + |
| 10 | +## Step 1: Scale Down |
| 11 | + |
| 12 | +Destroy all replica machines. **Keep the primary in dfw** (7817602a936548). |
| 13 | + |
| 14 | +```bash |
| 15 | +fly machine destroy 0802499c556068 -a kcd # dfw replica |
| 16 | +fly machine destroy 48e3624a76dd28 -a kcd # gru |
| 17 | +fly machine destroy 7811e97b0e2978 -a kcd # jnb |
| 18 | +fly machine destroy 7843929b121938 -a kcd # ams |
| 19 | +fly machine destroy 784e4d3fd595e8 -a kcd # sin |
| 20 | +fly machine destroy 865139bee15658 -a kcd # bom |
| 21 | +fly machine destroy d890de3fe99298 -a kcd # syd |
| 22 | +fly machine destroy e822040fee7398 -a kcd # cdg |
| 23 | +``` |
| 24 | + |
| 25 | +**Do NOT destroy 7817602a936548** — that is the dfw primary with the volume. |
| 26 | + |
| 27 | +## Step 2: Deploy the Fix |
| 28 | + |
| 29 | +The build requires `SENTRY_AUTH_TOKEN` as a build secret and `COMMIT_SHA` as a build arg: |
| 30 | + |
| 31 | +```bash |
| 32 | +fly deploy -a kcd \ |
| 33 | + --build-arg COMMIT_SHA=$(git rev-parse HEAD) \ |
| 34 | + --build-secret SENTRY_AUTH_TOKEN=$SENTRY_AUTH_TOKEN |
| 35 | +``` |
| 36 | + |
| 37 | +Or push to `main` and let the GitHub Action deploy (it has access to the secret). |
| 38 | + |
| 39 | +This updates the single remaining machine with the new image (including the FLY_MACHINE_ID fix). |
| 40 | + |
| 41 | +## Step 3: Verify |
| 42 | + |
| 43 | +- Check status: `fly status -a kcd` |
| 44 | +- Check health: `fly checks list -a kcd` |
| 45 | +- Test site: `curl -I https://kentcdodds.com/healthcheck` |
| 46 | +- Check logs: `fly logs -a kcd` (look for ZodError — should be gone) |
| 47 | + |
| 48 | +## Step 4: Scale Back Up |
| 49 | + |
| 50 | +Run deploy again to recreate replicas: |
| 51 | + |
| 52 | +```bash |
| 53 | +fly deploy -a kcd |
| 54 | +``` |
| 55 | + |
| 56 | +Fly may recreate machines based on previous configuration. If replicas are not recreated, you may need to clone the primary to other regions via the Fly dashboard or `fly machine clone`. |
| 57 | + |
| 58 | +## Notes |
| 59 | + |
| 60 | +- **Downtime**: Brief downtime during scale-down and deploy is expected. |
| 61 | +- **Primary**: The primary holds the LiteFS volume; it must stay in dfw (primary_region in fly.toml). |
| 62 | +- **Machine IDs**: Run `fly machines list -a kcd` to get current IDs before scaling down — they may change between runs. |
0 commit comments