You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/book/src/tasks/automated-machine-management/healthchecking.md
+18-11Lines changed: 18 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,8 +103,7 @@ This feature is only available for KubeadmControlPlane.
103
103
</aside>
104
104
105
105
KubeadmControlPlane allows to control how remediation happen by defining an optional `remediationStrategy`;
106
-
this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,
107
-
or for allowing the infrastructure provider to stabilize in case of temporary problems:
106
+
this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,or for allowing the infrastructure provider to stabilize in case of temporary problems.
108
107
109
108
```yaml
110
109
apiVersion: cluster.x-k8s.io/v1beta1
@@ -119,22 +118,30 @@ spec:
119
118
minHealthyPeriod: 2h
120
119
```
121
120
122
-
`maxRetry`is the Max number of retries while attempting to remediate an unhealthy machine.
121
+
`maxRetry`is the maximum number of retries while attempting to remediate an unhealthy machine.
123
122
A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
124
123
For example, given a control plane with three machines M1, M2, M3:
125
124
126
125
- M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
127
126
- If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
128
-
remediated; such operation is considered a retry, remediation-retry #1.
127
+
remediated. This operation is considered a retry - remediation-retry #1.
129
128
- If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
130
129
131
-
A retry could happen only after `retryPeriod` from the previous retry; if `retryPeriod` is not set (default),
132
-
a retry will happen immediately.
130
+
A retry will only happen after the `retryPeriod` from the previous retry has elapsed. If `retryPeriod` is not set (default), a retry will happen immediately.
133
131
134
-
If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) from the previous remediation expired,
135
-
this is not considered a retry anymore because the new issue is assumed unrelated from the previous one.
132
+
If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) has passed since the previous remediation this is no longer considered a retry because the new issue is assumed unrelated from the previous one.
136
133
137
-
If `maxRetry` is not set (default), the remedation will be retried infinitely.
134
+
If `maxRetry` is not set (default), remediation will be retried infinitely.
135
+
136
+
<aside class="note">
137
+
138
+
<h1> Retry again once maxRetry is exhausted</h1>
139
+
140
+
If for some reasons you want to remediate once maxRetry is exhausted there are two options:
141
+
- Temporarily increase `maxRetry` (recommended)
142
+
- Remove the `controlplane.cluster.x-k8s.io/remediation-for` annotation from the unhealthy machine or decrease `retryCount` in the annotation value.
143
+
144
+
</aside>
138
145
139
146
## Remediation Short-Circuiting
140
147
@@ -218,11 +225,11 @@ Before deploying a MachineHealthCheck, please familiarise yourself with the foll
218
225
219
226
- Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment)
220
227
- Machines managed by a KubeadmControlPlane are remediated according to [the delete-and-recreate guidelines described in the KubeadmControlPlane proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191017-kubeadm-based-control-plane.md#remediation-using-delete-and-recreate)
221
-
- Following rules should be satisfied in order to start remediation of a control plane machine:
228
+
- The following rules should be satisfied in order to start remediation of a control plane machine:
222
229
- One of the following apply:
223
230
- The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
224
231
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
225
-
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP to remediate more machines while the replacement for the previous machine is not yet created.
232
+
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created.
226
233
- The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
227
234
- Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP already initialized and with managed etcd)
228
235
- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately
0 commit comments