|
| 1 | +# MachineHealthChecks with k0smotron |
| 2 | + |
| 3 | +k0smotron provides built-in support for MachineHealthChecks (MHC), a core Cluster API feature that automatically detects and remediates unhealthy control plane machines. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +k0smotron's control plane controller automatically handles the remediation process when machines are marked as unhealthy by MHC. |
| 8 | + |
| 9 | +Read more about MachineHealthChecks in the [Cluster API documentation](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#configure-a-machinehealthcheckbb). |
| 10 | + |
| 11 | +## How it works |
| 12 | + |
| 13 | +When a MachineHealthCheck detects an unhealthy control plane machine: |
| 14 | + |
| 15 | +1. The MHC controller marks the machine as unhealthy |
| 16 | +2. k0smotron's control plane controller detects this condition |
| 17 | +3. The controller safely deletes the unhealthy machine |
| 18 | +4. A new machine is automatically created to replace it |
| 19 | + |
| 20 | +## Prerequisites |
| 21 | + |
| 22 | +- A k0smotron control plane with at least 2 replicas (required for safe remediation) |
| 23 | +- MachineHealthCheck controller running in your management cluster |
| 24 | + |
| 25 | +## Example Configuration |
| 26 | + |
| 27 | +Here's a simple example of how to set up MachineHealthChecks for a k0smotron control plane: |
| 28 | + |
| 29 | +```yaml |
| 30 | +apiVersion: cluster.x-k8s.io/v1beta1 |
| 31 | +kind: MachineHealthCheck |
| 32 | +metadata: |
| 33 | + name: k0smotron-controlplane-mhc |
| 34 | + namespace: default |
| 35 | +spec: |
| 36 | + clusterName: my-cluster |
| 37 | + selector: |
| 38 | + matchLabels: |
| 39 | + cluster.x-k8s.io/control-plane: "true" |
| 40 | + unhealthyConditions: |
| 41 | + - type: Ready |
| 42 | + status: Unknown |
| 43 | + timeout: 300s |
| 44 | + - type: Ready |
| 45 | + status: "False" |
| 46 | + timeout: 300s |
| 47 | + nodeStartupTimeout: 10m |
| 48 | +``` |
| 49 | +
|
| 50 | +## Safety Features |
| 51 | +
|
| 52 | +k0smotron includes several safety mechanisms to prevent cluster disruption: |
| 53 | +
|
| 54 | +- **Minimum replicas**: Remediation only occurs when there are more than 1 control plane replica |
| 55 | +- **No concurrent operations**: Waits for provisioning/deletion operations to complete |
| 56 | +- **Graceful deletion**: Properly removes machines from the k0s cluster before deletion |
| 57 | +
|
| 58 | +## Best Practices |
| 59 | +
|
| 60 | +- Always use at least 3 control plane replicas in production |
| 61 | +- Set appropriate timeouts based on your infrastructure |
| 62 | +- Monitor remediation events and adjust thresholds as needed |
| 63 | +- Test remediation in non-production environments first |
0 commit comments