You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/book/src/tasks/automated-machine-management/healthchecking.md
+58Lines changed: 58 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -92,6 +92,57 @@ in order to prevent conflicts or unexpected behaviors when trying to remediate t
92
92
93
93
</aside>
94
94
95
+
## Controlling remediation retries
96
+
97
+
<aside class="note warning">
98
+
99
+
<h1> Important </h1>
100
+
101
+
This feature is only available for KubeadmControlPlane.
102
+
103
+
</aside>
104
+
105
+
KubeadmControlPlane allows to control how remediation happen by defining an optional `remediationStrategy`;
106
+
this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,or for allowing the infrastructure provider to stabilize in case of temporary problems.
107
+
108
+
```yaml
109
+
apiVersion: cluster.x-k8s.io/v1beta1
110
+
kind: KubeadmControlPlane
111
+
metadata:
112
+
name: my-control-plane
113
+
spec:
114
+
...
115
+
remediationStrategy:
116
+
maxRetry: 5
117
+
retryPeriod: 2m
118
+
minHealthyPeriod: 2h
119
+
```
120
+
121
+
`maxRetry`is the maximum number of retries while attempting to remediate an unhealthy machine.
122
+
A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
123
+
For example, given a control plane with three machines M1, M2, M3:
124
+
125
+
- M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
126
+
- If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
127
+
remediated. This operation is considered a retry - remediation-retry #1.
128
+
- If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
129
+
130
+
A retry will only happen after the `retryPeriod` from the previous retry has elapsed. If `retryPeriod` is not set (default), a retry will happen immediately.
131
+
132
+
If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) has passed since the previous remediation this is no longer considered a retry because the new issue is assumed unrelated from the previous one.
133
+
134
+
If `maxRetry` is not set (default), remediation will be retried infinitely.
135
+
136
+
<aside class="note">
137
+
138
+
<h1> Retry again once maxRetry is exhausted</h1>
139
+
140
+
If for some reasons you want to remediate once maxRetry is exhausted there are two options:
141
+
- Temporarily increase `maxRetry` (recommended)
142
+
- Remove the `controlplane.cluster.x-k8s.io/remediation-for` annotation from the unhealthy machine or decrease `retryCount` in the annotation value.
143
+
144
+
</aside>
145
+
95
146
## Remediation Short-Circuiting
96
147
97
148
To ensure that MachineHealthChecks only remediate Machines when the cluster is healthy,
@@ -174,6 +225,13 @@ Before deploying a MachineHealthCheck, please familiarise yourself with the foll
174
225
175
226
- Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment)
176
227
- Machines managed by a KubeadmControlPlane are remediated according to [the delete-and-recreate guidelines described in the KubeadmControlPlane proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191017-kubeadm-based-control-plane.md#remediation-using-delete-and-recreate)
228
+
- The following rules should be satisfied in order to start remediation of a control plane machine:
229
+
- One of the following apply:
230
+
- The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
231
+
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
232
+
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created.
233
+
- The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
234
+
- Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP already initialized and with managed etcd)
177
235
- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately
178
236
- If no Node joins the cluster for a Machine after the `NodeStartupTimeout`, the Machine will be remediated
179
237
- If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately
0 commit comments