Skip to content

Commit 251b2dd

Browse files
committed
docs(076): AKS node maintenance pod evictions — not a code bug
Documented the March 31 incident where captions and other apps stopped unexpectedly during testing on US West. Root cause: Azure AKS scheduled node maintenance (RebootScheduled/RedeployScheduled events on vmss00001n). Evidence: - No code deployed to captions repo since March 11 (20 days ago) - Multiple unrelated pods killed simultaneously (captions, soga, camera-photo, betterstack-collector) - Kubernetes events show RebootScheduled + RedeployScheduled on the node - Same ReplicaSet, new pod name — Kubernetes rescheduled, not redeployed - Cloud handled correctly: detected disconnect, tried resurrection, got 503, reported app_stopped Recommendations: - Configure AKS Planned Maintenance Windows (low-traffic hours only) - Add verification commands to runbooks - Pod Disruption Budgets if we scale to multiple replicas
1 parent 9b5dbc2 commit 251b2dd

File tree

1 file changed

+201
-0
lines changed
  • cloud/issues/076-aks-node-maintenance-pod-evictions

1 file changed

+201
-0
lines changed
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# Spike: AKS Node Maintenance Pod Evictions
2+
3+
**Issue:** 076
4+
**Status:** Documented — no code fix needed
5+
**Date:** 2026-03-31
6+
**Reported by:** Isaiah (captions app stopped unexpectedly on US West during testing)
7+
8+
---
9+
10+
## Overview
11+
12+
**What this doc covers:** Investigation of unexpected pod restarts on March 31, 2026 that caused the captions debug app (and other apps) to stop mid-session. Root cause: Azure AKS scheduled node maintenance, not a code bug.
13+
14+
**Why this doc exists:** The team will see pods restarting and apps disconnecting without any code changes being pushed. This documents the evidence so we don't waste time investigating a non-bug.
15+
16+
---
17+
18+
## Incident Timeline
19+
20+
**2026-03-31 ~00:28 UTC (5:28 PM Pacific)**
21+
22+
Isaiah was running `com.mentra.captions.debug` on US West (cluster 4965) to test the hot-path allocation hotfix (PR #2389). The captions app suddenly stopped — no error on the glasses, just captions disappeared.
23+
24+
### Cloud Logs (US West)
25+
26+
```
27+
[00:31:06] [AppManager] Resurrection failed for com.mentra.captions.debug:
28+
Webhook failed: Webhook failed after 2 attempts: Request failed with status code 503
29+
[00:31:06] [AppManager] Sent app_stopped to mobile after resurrection failure
30+
[00:31:07] [MicrophoneManager] Mic-off holddown complete, still no media subscriptions - turning mic off
31+
```
32+
33+
The cloud detected the `app-ws` disconnect, tried to resurrect by sending a webhook to the captions debug server, got 503 (server unavailable), and reported app_stopped to the mobile client.
34+
35+
### Was it a code deploy?
36+
37+
**No.** Last deploy to the captions repo:
38+
39+
```
40+
$ gh api repos/Mentra-Community/LiveCaptionsOnSmartGlasses/actions/runs
41+
42+
2026-03-12 [beta] Deploy captions-beta
43+
2026-03-11 [debug] Deploy captions-debug
44+
2026-03-11 [debug] Deploy captions-debug
45+
2026-01-24 [debug] Deploy captions-debug
46+
```
47+
48+
Last captions-debug deploy was **March 11** — 20 days before this incident. Nobody pushed any code.
49+
50+
### What actually happened
51+
52+
```
53+
$ porter kubectl --cluster 4689 -- get events -n default --sort-by=.lastTimestamp
54+
55+
2m Warning RebootScheduled node/aks-a4689qbpv-27613717-vmss00001n Timeout when running plugin check_reboot.sh
56+
2m Warning RedeployScheduled node/aks-a4689qbpv-27613717-vmss00001n Timeout when running plugin check_redeploy.sh
57+
2m Normal Killing pod/better-stack-collector-5lptb Stopping container ebpf
58+
2m Normal Killing pod/better-stack-collector-gj24z Stopping container collector
59+
89s Normal Killing pod/better-stack-collector-pzm7z Stopping container collector
60+
73s Warning BackOff pod/soga-dev-soga-56b484fd75-82wjc Back-off restarting failed container
61+
2s Normal Killing pod/camera-photo-dev-... Stopping container
62+
```
63+
64+
**Azure AKS was doing scheduled node maintenance** on node `vmss00001n`. The `RebootScheduled` and `RedeployScheduled` events confirm Azure was patching/rebooting the underlying VM. ALL pods on that node were evicted — not just captions:
65+
66+
- `better-stack-collector` (3 pods killed)
67+
- `soga-dev` (killed, back-off restart)
68+
- `camera-photo-dev` (killed)
69+
- `captions-debug` (killed, rescheduled)
70+
- `captions-live` (killed, rescheduled)
71+
72+
### Pod replacement evidence
73+
74+
```
75+
$ porter kubectl --cluster 4689 -- get pods | grep captions-debug
76+
77+
captions-debug-live-captions-5bc8d8c765-8tklh 1/1 Running 0 4m36s
78+
```
79+
80+
- Same ReplicaSet (`5bc8d8c765`, created 18 days ago) — NOT a new deployment
81+
- New pod name (`8tklh` replaced old `sgcbx`) — Kubernetes killed the old pod and created a new one
82+
- Restart count 0 — fresh pod, not a container restart
83+
- The new pod pulled the same image (`44e30a5a...`) in 13 seconds and started normally
84+
85+
---
86+
87+
## Root Cause
88+
89+
**Azure AKS scheduled node maintenance.** Azure periodically patches and reboots the underlying VMs that host Kubernetes nodes. When a node is rebooted:
90+
91+
1. All pods on that node receive SIGTERM
92+
2. Kubernetes waits `terminationGracePeriodSeconds` (30s for our apps)
93+
3. Pods are killed
94+
4. Kubernetes reschedules them on other nodes
95+
5. New pods start up (image pull + container start)
96+
97+
During steps 3-5, the app is down. For captions-debug, this was ~2 minutes.
98+
99+
### How often does Azure do this?
100+
101+
| Type | Frequency | Impact |
102+
|------|-----------|--------|
103+
| Security patches (CVEs) | Weekly | Nodes rebooted one at a time (rolling) |
104+
| Platform updates | Monthly | Hypervisor/host OS updates |
105+
| Unplanned hardware | Rare | Failing hardware, live migration or reboot |
106+
107+
The `RebootScheduled` event we saw is most likely the weekly security patch cycle.
108+
109+
---
110+
111+
## Cloud Behavior During Eviction
112+
113+
The cloud handled this correctly:
114+
115+
1. ✅ Detected `app-ws` disconnect when captions pod was killed
116+
2. ✅ Attempted resurrection via webhook (correct behavior)
117+
3. ✅ Got 503 because captions server was still restarting (expected)
118+
4. ✅ Reported `app_stopped` to mobile (correct — user knows the app stopped)
119+
5. ✅ User can restart the app once the new pod is ready
120+
121+
**No code fix needed.** The cloud's reconnection and resurrection logic worked as designed.
122+
123+
---
124+
125+
## Recommendations
126+
127+
### 1. AKS Planned Maintenance Window (recommended)
128+
129+
Configure AKS to only do maintenance during low-traffic hours:
130+
131+
```
132+
az aks maintenancewindow add \
133+
--resource-group <RG> \
134+
--cluster-name <CLUSTER> \
135+
--name default \
136+
--schedule-type Weekly \
137+
--day-of-week Sunday \
138+
--start-time 02:00 \
139+
--duration 4
140+
```
141+
142+
This tells Azure: "only reboot nodes between 2-6 AM UTC on Sundays." Reduces impact on active users.
143+
144+
Applies to all clusters: US Central (4689), France (4696), East Asia (4754), US West (4965), US East (4977).
145+
146+
### 2. Pod Disruption Budgets (optional)
147+
148+
Add a PodDisruptionBudget to ensure at least 1 replica stays running during node drains:
149+
150+
```yaml
151+
apiVersion: policy/v1
152+
kind: PodDisruptionBudget
153+
metadata:
154+
name: cloud-prod-pdb
155+
spec:
156+
minAvailable: 1
157+
selector:
158+
matchLabels:
159+
app.kubernetes.io/name: cloud-prod-cloud
160+
```
161+
162+
Only useful if we run multiple replicas per region (we currently run 1).
163+
164+
### 3. Document in runbooks
165+
166+
Add to the pod-crash runbook: "If multiple unrelated pods restart simultaneously on the same cluster, check for AKS node maintenance events before investigating code bugs."
167+
168+
---
169+
170+
## How to Verify This in the Future
171+
172+
If someone reports apps randomly dying and you suspect node maintenance:
173+
174+
```bash
175+
# Check for node events (RebootScheduled, RedeployScheduled, NodeNotReady)
176+
porter kubectl --cluster <CLUSTER_ID> -- get events -n default --sort-by=.lastTimestamp | grep -i "reboot\|redeploy\|evict\|drain\|NotReady"
177+
178+
# Check if multiple unrelated pods restarted at the same time
179+
porter kubectl --cluster <CLUSTER_ID> -- get pods -n default --sort-by=.status.startTime | tail -20
180+
181+
# Check which node a pod was on
182+
porter kubectl --cluster <CLUSTER_ID> -- get pods -n default -o wide | grep <POD_NAME>
183+
184+
# Check node status
185+
porter kubectl --cluster <CLUSTER_ID> -- get nodes
186+
porter kubectl --cluster <CLUSTER_ID> -- describe node <NODE_NAME> | grep -A10 "Conditions"
187+
```
188+
189+
If you see `RebootScheduled`/`RedeployScheduled` events and multiple pods killed at the same time — it's Azure maintenance, not a code bug.
190+
191+
---
192+
193+
## Cluster IDs for Reference
194+
195+
| Region | Cluster ID | Node pool prefix |
196+
|--------|-----------|-----------------|
197+
| US Central | 4689 | aks-a4689qbpv |
198+
| France | 4696 | aks-a4696* |
199+
| East Asia | 4754 | aks-a4754* |
200+
| US West | 4965 | aks-a4965* |
201+
| US East | 4977 | aks-a4977* |

0 commit comments

Comments
 (0)