Skip to content

Commit 7bc8b22

Browse files
Add documentation about KCP remediation
1 parent 8ecf669 commit 7bc8b22

File tree

4 files changed

+54
-3
lines changed

4 files changed

+54
-3
lines changed

controlplane/kubeadm/api/v1beta1/kubeadm_control_plane_types.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ type RemediationStrategy struct {
193193
// M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
194194
// If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
195195
// remediated; such operation is considered a retry, remediation-retry #1.
196-
// If M1-2 (replacement of M1-2) becomes unhealthy, remediation-retry #2 will happen, etc.
196+
// If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
197197
//
198198
// A retry could happen only after RetryPeriod from the previous retry.
199199
// If a machine is marked as unhealthy after MinHealthyPeriod from the previous remediation expired,

controlplane/kubeadm/config/crd/bases/controlplane.cluster.x-k8s.io_kubeadmcontrolplanes.yaml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

controlplane/kubeadm/config/crd/bases/controlplane.cluster.x-k8s.io_kubeadmcontrolplanetemplates.yaml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/book/src/tasks/automated-machine-management/healthchecking.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,50 @@ in order to prevent conflicts or unexpected behaviors when trying to remediate t
9292

9393
</aside>
9494

95+
## Controlling remediation retries
96+
97+
<aside class="note warning">
98+
99+
<h1> Important </h1>
100+
101+
This feature is only available for KubeadmControlPlane.
102+
103+
</aside>
104+
105+
KubeadmControlPlane allows to control how remediation happen by defining an optional `remediationStrategy`;
106+
this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,
107+
or for allowing the infrastructure provider to stabilize in case of temporary problems:
108+
109+
```yaml
110+
apiVersion: cluster.x-k8s.io/v1beta1
111+
kind: KubeadmControlPlane
112+
metadata:
113+
name: my-control-plane
114+
spec:
115+
...
116+
remediationStrategy:
117+
maxRetry: 5
118+
retryPeriod: 2m
119+
minHealthyPeriod: 2h
120+
```
121+
122+
`maxRetry` is the Max number of retries while attempting to remediate an unhealthy machine.
123+
A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
124+
For example, given a control plane with three machines M1, M2, M3:
125+
126+
- M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
127+
- If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
128+
remediated; such operation is considered a retry, remediation-retry #1.
129+
- If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
130+
131+
A retry could happen only after `retryPeriod` from the previous retry; if `retryPeriod` is not set (default),
132+
a retry will happen immediately.
133+
134+
If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) from the previous remediation expired,
135+
this is not considered a retry anymore because the new issue is assumed unrelated from the previous one.
136+
137+
If `maxRetry` is not set (default), the remedation will be retried infinitely.
138+
95139
## Remediation Short-Circuiting
96140

97141
To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy,
@@ -174,6 +218,13 @@ Before deploying a MachineHealthCheck, please familiarise yourself with the foll
174218

175219
- Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment)
176220
- Machines managed by a KubeadmControlPlane are remediated according to [the delete-and-recreate guidelines described in the KubeadmControlPlane proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191017-kubeadm-based-control-plane.md#remediation-using-delete-and-recreate)
221+
- Following rules should be satisfied in order to start remediation of a control plane machine:
222+
- One of the following apply:
223+
- The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
224+
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
225+
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP to remediate more machines while the replacement for the previous machine is not yet created.
226+
- The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
227+
- Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP already initialized and with managed etcd)
177228
- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately
178229
- If no Node joins the cluster for a Machine after the `NodeStartupTimeout`, the Machine will be remediated
179230
- If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately

0 commit comments

Comments
 (0)