Skip to content

Commit 87a1c9d

Browse files
authored
Merge pull request #8327 from fabriziopandini/document-kcp-remediation-changes
📖 Add documentation about KCP remediation
2 parents 77375ac + 80610e9 commit 87a1c9d

File tree

4 files changed

+61
-3
lines changed

4 files changed

+61
-3
lines changed

controlplane/kubeadm/api/v1beta1/kubeadm_control_plane_types.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ type RemediationStrategy struct {
193193
// M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
194194
// If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
195195
// remediated; such operation is considered a retry, remediation-retry #1.
196-
// If M1-2 (replacement of M1-2) becomes unhealthy, remediation-retry #2 will happen, etc.
196+
// If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
197197
//
198198
// A retry could happen only after RetryPeriod from the previous retry.
199199
// If a machine is marked as unhealthy after MinHealthyPeriod from the previous remediation expired,

controlplane/kubeadm/config/crd/bases/controlplane.cluster.x-k8s.io_kubeadmcontrolplanes.yaml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

controlplane/kubeadm/config/crd/bases/controlplane.cluster.x-k8s.io_kubeadmcontrolplanetemplates.yaml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/book/src/tasks/automated-machine-management/healthchecking.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,57 @@ in order to prevent conflicts or unexpected behaviors when trying to remediate t
9292

9393
</aside>
9494

95+
## Controlling remediation retries
96+
97+
<aside class="note warning">
98+
99+
<h1> Important </h1>
100+
101+
This feature is only available for KubeadmControlPlane.
102+
103+
</aside>
104+
105+
KubeadmControlPlane allows to control how remediation happen by defining an optional `remediationStrategy`;
106+
this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,or for allowing the infrastructure provider to stabilize in case of temporary problems.
107+
108+
```yaml
109+
apiVersion: cluster.x-k8s.io/v1beta1
110+
kind: KubeadmControlPlane
111+
metadata:
112+
name: my-control-plane
113+
spec:
114+
...
115+
remediationStrategy:
116+
maxRetry: 5
117+
retryPeriod: 2m
118+
minHealthyPeriod: 2h
119+
```
120+
121+
`maxRetry` is the maximum number of retries while attempting to remediate an unhealthy machine.
122+
A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
123+
For example, given a control plane with three machines M1, M2, M3:
124+
125+
- M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
126+
- If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
127+
remediated. This operation is considered a retry - remediation-retry #1.
128+
- If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
129+
130+
A retry will only happen after the `retryPeriod` from the previous retry has elapsed. If `retryPeriod` is not set (default), a retry will happen immediately.
131+
132+
If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) has passed since the previous remediation this is no longer considered a retry because the new issue is assumed unrelated from the previous one.
133+
134+
If `maxRetry` is not set (default), remediation will be retried infinitely.
135+
136+
<aside class="note">
137+
138+
<h1> Retry again once maxRetry is exhausted</h1>
139+
140+
If for some reasons you want to remediate once maxRetry is exhausted there are two options:
141+
- Temporarily increase `maxRetry` (recommended)
142+
- Remove the `controlplane.cluster.x-k8s.io/remediation-for` annotation from the unhealthy machine or decrease `retryCount` in the annotation value.
143+
144+
</aside>
145+
95146
## Remediation Short-Circuiting
96147

97148
To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy,
@@ -174,6 +225,13 @@ Before deploying a MachineHealthCheck, please familiarise yourself with the foll
174225

175226
- Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment)
176227
- Machines managed by a KubeadmControlPlane are remediated according to [the delete-and-recreate guidelines described in the KubeadmControlPlane proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191017-kubeadm-based-control-plane.md#remediation-using-delete-and-recreate)
228+
- The following rules should be satisfied in order to start remediation of a control plane machine:
229+
- One of the following apply:
230+
- The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
231+
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
232+
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created.
233+
- The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
234+
- Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP already initialized and with managed etcd)
177235
- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately
178236
- If no Node joins the cluster for a Machine after the `NodeStartupTimeout`, the Machine will be remediated
179237
- If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately

0 commit comments

Comments
 (0)