Skip to content

Commit f95d43a

Browse files
authored
Health Checks docs (#1237)
Signed-off-by: makhov <[email protected]>
1 parent 272be34 commit f95d43a

File tree

2 files changed

+64
-0
lines changed

2 files changed

+64
-0
lines changed

docs/capi-machine-health-checks.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# MachineHealthChecks with k0smotron
2+
3+
k0smotron provides built-in support for MachineHealthChecks (MHC), a core Cluster API feature that automatically detects and remediates unhealthy control plane machines.
4+
5+
## Overview
6+
7+
k0smotron's control plane controller automatically handles the remediation process when machines are marked as unhealthy by MHC.
8+
9+
Read more about MachineHealthChecks in the [Cluster API documentation](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#configure-a-machinehealthcheckbb).
10+
11+
## How it works
12+
13+
When a MachineHealthCheck detects an unhealthy control plane machine:
14+
15+
1. The MHC controller marks the machine as unhealthy
16+
2. k0smotron's control plane controller detects this condition
17+
3. The controller safely deletes the unhealthy machine
18+
4. A new machine is automatically created to replace it
19+
20+
## Prerequisites
21+
22+
- A k0smotron control plane with at least 2 replicas (required for safe remediation)
23+
- MachineHealthCheck controller running in your management cluster
24+
25+
## Example Configuration
26+
27+
Here's a simple example of how to set up MachineHealthChecks for a k0smotron control plane:
28+
29+
```yaml
30+
apiVersion: cluster.x-k8s.io/v1beta1
31+
kind: MachineHealthCheck
32+
metadata:
33+
name: k0smotron-controlplane-mhc
34+
namespace: default
35+
spec:
36+
clusterName: my-cluster
37+
selector:
38+
matchLabels:
39+
cluster.x-k8s.io/control-plane: "true"
40+
unhealthyConditions:
41+
- type: Ready
42+
status: Unknown
43+
timeout: 300s
44+
- type: Ready
45+
status: "False"
46+
timeout: 300s
47+
nodeStartupTimeout: 10m
48+
```
49+
50+
## Safety Features
51+
52+
k0smotron includes several safety mechanisms to prevent cluster disruption:
53+
54+
- **Minimum replicas**: Remediation only occurs when there are more than 1 control plane replica
55+
- **No concurrent operations**: Waits for provisioning/deletion operations to complete
56+
- **Graceful deletion**: Properly removes machines from the k0s cluster before deletion
57+
58+
## Best Practices
59+
60+
- Always use at least 3 control plane replicas in production
61+
- Set appropriate timeouts based on your infrastructure
62+
- Monitor remediation events and adjust thresholds as needed
63+
- Test remediation in non-production environments first

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ nav:
3030
- Worker Node Bootstrap: capi-bootstrap.md
3131
- Remote Machine Provider: capi-remote.md
3232
- ClusterClass: capi-clusterclass.md
33+
- Health Checks: capi-machine-health-checks.md
3334
- Examples:
3435
- Software prerequisites: capi-examples.md
3536
- AWS (HCP): capi-aws.md

0 commit comments

Comments
 (0)