You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(monitor): add periodic status reporting and fix silent health checks
Two bugs caused zero output between Galera checks:
1. check_pod_health returned silently when pods were healthy or
when no pods matched a selector — no way to distinguish
2. Hourly alive message used (epoch_seconds % 3600 == 0) which
almost never fires (requires exact second alignment)
Fixes:
- Track pods_checked counter across all selectors per cycle
- Track check_cycle number for log correlation
- Replace broken modulo check with time-delta STATUS_REPORT_INTERVAL
(default 600s / 10 minutes, configurable)
- Print periodic summary: check number, pods scanned, pods tracked
- Always log immediately when issues are found
- Suppress stderr on pod listing to avoid noise for missing selectors
# Sets global pods_checked counter as side-effect for status reporting
64
65
check_pod_health() {
65
66
local selector="$1"
66
67
local error_patterns="$2"
67
68
local restart_enabled="${3:-false}"
68
69
69
-
local pods=$(oc get pods -l "$selector" --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}')
70
+
local pods=$(oc get pods -l "$selector" --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}'2>/dev/null)
70
71
71
72
if [[ -z"$pods" ]];then
72
73
return 0
@@ -102,6 +103,8 @@ check_pod_health() {
102
103
error_counts["$pod"]=0
103
104
fi
104
105
106
+
pods_checked=$((pods_checked +1))
107
+
105
108
# Restart only for restart-eligible pods at error threshold
106
109
if [[ "$restart_enabled"=="true"&&${error_counts["$pod"]:-0}-ge$ERROR_THRESHOLD ]];then
107
110
echo"$(date): Restarting $pod after $ERROR_THRESHOLD consecutive errors"
@@ -118,6 +121,9 @@ check_pod_health() {
118
121
119
122
# Main monitoring loop
120
123
last_galera_check=0
124
+
last_status_report=0
125
+
check_cycle=0
126
+
STATUS_REPORT_INTERVAL=${STATUS_REPORT_INTERVAL:-600}# Status summary every 10 minutes
121
127
122
128
# Send startup notification
123
129
send_notification "MONITORING_START""Pod Health Monitor Started""Continuous monitoring active with ${MONITORING_INTERVAL}s intervals. Galera checks every ${GALERA_CHECK_INTERVAL}s.""white_check_mark""$DEPLOY_NAMESPACE"
@@ -146,6 +152,7 @@ while true; do
146
152
147
153
# Perform health checks — restart-eligible services
0 commit comments