Skip to content

Commit 562b9f8

Browse files
fix: simplify healthcheck to avoid Fly timeout under load
- Remove getBlogReadRankings and self-fetch from healthcheck - Heavy checks were causing 5s Fly healthcheck timeouts - Add fly-scale-down-recovery-plan.md with COMMIT_SHA build arg Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 606569b commit 562b9f8

File tree

2 files changed

+65
-14
lines changed

2 files changed

+65
-14
lines changed

app/routes/healthcheck.tsx

Lines changed: 3 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,11 @@
1-
import { getBlogReadRankings } from '#app/utils/blog.server.ts'
21
import { prisma } from '#app/utils/prisma.server.ts'
32
import { type Route } from './+types/healthcheck'
43

54
export async function loader({ request }: Route.LoaderArgs) {
6-
const host =
7-
request.headers.get('X-Forwarded-Host') ?? request.headers.get('host')
8-
95
try {
10-
await Promise.all([
11-
prisma.user.count(),
12-
getBlogReadRankings({ request }),
13-
fetch(`${new URL(request.url).protocol}${host}`, {
14-
method: 'HEAD',
15-
headers: { 'x-healthcheck': 'true' },
16-
}).then((r) => {
17-
if (!r.ok) return Promise.reject(r)
18-
}),
19-
])
6+
// Minimal check: DB connectivity. Heavy checks (getBlogReadRankings, self-fetch)
7+
// were causing 5s Fly healthcheck timeouts under load.
8+
await prisma.user.count()
209
return new Response('OK')
2110
} catch (error: unknown) {
2211
console.error(request.url, 'healthcheck ❌', { error })
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Fly.io Scale-Down Recovery Plan
2+
3+
Recovery plan for kcd production when LiteFS/deployment issues cause cascading failures. Scale to a single instance, deploy the fix, verify, then scale back up.
4+
5+
## Prerequisites
6+
7+
- Fly CLI logged in (`fly auth whoami`)
8+
- FLY_MACHINE_ID fix committed and ready to deploy
9+
10+
## Step 1: Scale Down
11+
12+
Destroy all replica machines. **Keep the primary in dfw** (7817602a936548).
13+
14+
```bash
15+
fly machine destroy 0802499c556068 -a kcd # dfw replica
16+
fly machine destroy 48e3624a76dd28 -a kcd # gru
17+
fly machine destroy 7811e97b0e2978 -a kcd # jnb
18+
fly machine destroy 7843929b121938 -a kcd # ams
19+
fly machine destroy 784e4d3fd595e8 -a kcd # sin
20+
fly machine destroy 865139bee15658 -a kcd # bom
21+
fly machine destroy d890de3fe99298 -a kcd # syd
22+
fly machine destroy e822040fee7398 -a kcd # cdg
23+
```
24+
25+
**Do NOT destroy 7817602a936548** — that is the dfw primary with the volume.
26+
27+
## Step 2: Deploy the Fix
28+
29+
The build requires `SENTRY_AUTH_TOKEN` as a build secret and `COMMIT_SHA` as a build arg:
30+
31+
```bash
32+
fly deploy -a kcd \
33+
--build-arg COMMIT_SHA=$(git rev-parse HEAD) \
34+
--build-secret SENTRY_AUTH_TOKEN=$SENTRY_AUTH_TOKEN
35+
```
36+
37+
Or push to `main` and let the GitHub Action deploy (it has access to the secret).
38+
39+
This updates the single remaining machine with the new image (including the FLY_MACHINE_ID fix).
40+
41+
## Step 3: Verify
42+
43+
- Check status: `fly status -a kcd`
44+
- Check health: `fly checks list -a kcd`
45+
- Test site: `curl -I https://kentcdodds.com/healthcheck`
46+
- Check logs: `fly logs -a kcd` (look for ZodError — should be gone)
47+
48+
## Step 4: Scale Back Up
49+
50+
Run deploy again to recreate replicas:
51+
52+
```bash
53+
fly deploy -a kcd
54+
```
55+
56+
Fly may recreate machines based on previous configuration. If replicas are not recreated, you may need to clone the primary to other regions via the Fly dashboard or `fly machine clone`.
57+
58+
## Notes
59+
60+
- **Downtime**: Brief downtime during scale-down and deploy is expected.
61+
- **Primary**: The primary holds the LiteFS volume; it must stay in dfw (primary_region in fly.toml).
62+
- **Machine IDs**: Run `fly machines list -a kcd` to get current IDs before scaling down — they may change between runs.

0 commit comments

Comments
 (0)