You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,8 @@ service that favours safety, explicit configuration, and verifiable supply-chain
16
16
timing/exit-code data for diagnostics.
17
17
-**Health gating** – executes an operator-supplied script twice (pre- and post-lock) with rich environment variables for
18
18
node identity, cluster policies, maintenance windows, and optional metrics endpoints.
19
+
-**Cluster-wide health coordination** – persists unhealthy node markers in etcd so peers refuse to reboot while any script is
20
+
reporting failure, keeps publishing each node's health even when no reboot is pending, applies configured cluster policy thresholds (minimum healthy counts, fractions, fallback protections) before allowing another reboot, and clears the block automatically once the node becomes healthy again.【F:pkg/clusterhealth/etcd.go†L18-L153】【F:pkg/orchestrator/runner.go†L321-L469】
19
21
-**Distributed coordination** – etcd-backed mutex with annotated metadata (`node`, `pid`, `acquired_at`) so operators can
20
22
inspect lock holders during incidents.
21
23
-**Safeguards** – kill switch file, dry-run mode, deny/allow maintenance windows, a configurable cooldown between
6.**Wire observability and safety toggles** – Define `kill_switch_file` so a
70
72
single touch blocks reboots, and enable the Prometheus listener via
71
73
`metrics.enabled`/`metrics.listen` when metrics are required.【F:examples/config.yaml†L41-L47】【F:examples/config.yaml†L114-L118】【F:cmd/clusterrebootd/main.go†L193-L252】
@@ -98,6 +100,13 @@ Health scripts are the final safeguard before a reboot. Follow these practices:
98
100
Diagnostics invoked with `status --skip-health` or `--skip-lock` set
99
101
`RC_SKIP_HEALTH`/`RC_SKIP_LOCK` to `true`, allowing scripts to short-circuit
100
102
optional checks when operators intentionally bypass them.【F:cmd/clusterrebootd/main.go†L298-L305】【F:pkg/orchestrator/runner.go†L485-L500】
103
+
-**Expect global gating on failure** – The coordinator now stores an unhealthy
104
+
marker in etcd whenever the script exits non-zero, runs the script even when
105
+
no reboot is pending so the cluster view stays current, and evaluates the
106
+
configured cluster policy thresholds before allowing another reboot. Peers
107
+
block their own reboots until a later pass succeeds and clears the entry.
108
+
Use the `status` command with health checks enabled to verify the marker
109
+
clears after remediation.【F:pkg/clusterhealth/etcd.go†L18-L153】【F:pkg/orchestrator/runner.go†L321-L469】
101
110
-**Return meaningful exit codes** – Exit `0` to allow the reboot, non-zero to
102
111
block it. Write concise status details to stdout/stderr; they are captured in
103
112
the JSON logs and CLI output for incident response.【F:cmd/clusterrebootd/main.go†L482-L517】
0 commit comments