Skip to content

Commit b8fc93d

Browse files
committed
fix(ci): add maintenance rollback safety net + emergency recovery workflow
- deploy-prod.yml: rollback maintenance on migration failure (if: failure()) - deploy-staging.yml: same rollback safety net - NEW emergency-clear-maintenance.yml: one-click workflow_dispatch to clear stuck maintenance mode - CHANGELOG: document 2026-02-28 production outage and fix
1 parent 49b9d19 commit b8fc93d

File tree

4 files changed

+93
-0
lines changed

4 files changed

+93
-0
lines changed

.github/workflows/deploy-prod.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,22 @@ jobs:
6262
fi
6363
6464
- name: Migrate DB
65+
id: migrate
6566
run: TURSO_TOKEN=${{ secrets.PROD_TURSO_TOKEN }} atlas migrate apply --dir "file://migrations" --env prod
6667

68+
- name: Rollback maintenance on migration failure
69+
if: failure() && steps.migrate.outcome == 'failure'
70+
uses: appleboy/ssh-action@v1
71+
with:
72+
host: 62.210.92.144
73+
username: root
74+
key: ${{ secrets.PROD_SSH_KEY }}
75+
script: |
76+
set -xe
77+
cd zenao
78+
git restore prod.backend.docker-compose.yml
79+
docker compose -f prod.backend.docker-compose.yml up -d --build backend
80+
6781
- name: Upgrade backend
6882
uses: appleboy/ssh-action@v1
6983
with:

.github/workflows/deploy-staging.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,22 @@ jobs:
6262
fi
6363
6464
- name: Migrate DB
65+
id: migrate
6566
run: TURSO_TOKEN=${{ secrets.STAGING_TURSO_TOKEN }} atlas migrate apply --dir "file://migrations" --env staging
67+
68+
- name: Rollback maintenance on migration failure
69+
if: failure() && steps.migrate.outcome == 'failure'
70+
uses: appleboy/ssh-action@v1
71+
with:
72+
host: 51.159.98.163
73+
username: root
74+
key: ${{ secrets.STAGING_SSH_KEY }}
75+
script: |
76+
set -xe
77+
cd zenao
78+
git restore staging.docker-compose.yml
79+
docker compose -f staging.docker-compose.yml up -d --build backend
80+
6681
- name: Upgrade services
6782
uses: appleboy/ssh-action@v1
6883
with:
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
name: Emergency Clear Maintenance
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
target:
7+
description: "Target environment"
8+
required: true
9+
default: "prod"
10+
type: choice
11+
options:
12+
- prod
13+
- staging
14+
15+
jobs:
16+
clear-maintenance:
17+
runs-on: ubuntu-latest
18+
steps:
19+
- name: Clear maintenance mode (prod)
20+
if: inputs.target == 'prod'
21+
uses: appleboy/ssh-action@v1
22+
with:
23+
host: 62.210.92.144
24+
username: root
25+
key: ${{ secrets.PROD_SSH_KEY }}
26+
script: |
27+
set -xe
28+
cd zenao
29+
echo "=== Before restore ==="
30+
grep -n "maintenance\|command" prod.backend.docker-compose.yml || true
31+
git restore prod.backend.docker-compose.yml
32+
echo "=== After restore ==="
33+
grep -n "maintenance\|command" prod.backend.docker-compose.yml || true
34+
docker compose -f prod.backend.docker-compose.yml up -d --build backend
35+
sleep 10
36+
docker logs backend --tail 10
37+
38+
- name: Clear maintenance mode (staging)
39+
if: inputs.target == 'staging'
40+
uses: appleboy/ssh-action@v1
41+
with:
42+
host: 51.159.98.163
43+
username: root
44+
key: ${{ secrets.STAGING_SSH_KEY }}
45+
script: |
46+
set -xe
47+
cd zenao
48+
echo "=== Before restore ==="
49+
grep -n "maintenance\|command" staging.docker-compose.yml || true
50+
git restore staging.docker-compose.yml
51+
echo "=== After restore ==="
52+
grep -n "maintenance\|command" staging.docker-compose.yml || true
53+
docker compose -f staging.docker-compose.yml up -d --build backend
54+
sleep 10
55+
docker logs backend --tail 10

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,15 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

77
---
88

9+
## [Unreleased] — Production Incident Fix & Deploy Hardening
10+
11+
### Fixed
12+
- **Production outage (2026-02-28)**: Backend stuck in `--maintenance` mode after deploy #22523329085 failed at migration step — `git restore` in "Upgrade backend" never ran, leaving all RPC endpoints returning `CodeUnavailable`
13+
- **Deploy workflow safety net**: Added "Rollback maintenance on migration failure" step to both `deploy-prod.yml` and `deploy-staging.yml` — if migration fails, `git restore` + container rebuild runs automatically via `if: failure()`
14+
- **Emergency recovery workflow**: New `emergency-clear-maintenance.yml` — one-click workflow_dispatch to clear stuck maintenance mode on prod or staging via SSH
15+
16+
---
17+
918
## [Unreleased] — Repository Hygiene Sprint
1019

1120
### Security

0 commit comments

Comments
 (0)