Skip to content

Commit 80610e9

Browse files
address comments
1 parent 7bc8b22 commit 80610e9

File tree

1 file changed

+18
-11
lines changed

1 file changed

+18
-11
lines changed

docs/book/src/tasks/automated-machine-management/healthchecking.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,7 @@ This feature is only available for KubeadmControlPlane.
103103
</aside>
104104

105105
KubeadmControlPlane allows to control how remediation happen by defining an optional `remediationStrategy`;
106-
this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,
107-
or for allowing the infrastructure provider to stabilize in case of temporary problems:
106+
this feature can be used for preventing unnecessary load on infrastructure provider e.g. in case of quota problems,or for allowing the infrastructure provider to stabilize in case of temporary problems.
108107

109108
```yaml
110109
apiVersion: cluster.x-k8s.io/v1beta1
@@ -119,22 +118,30 @@ spec:
119118
minHealthyPeriod: 2h
120119
```
121120

122-
`maxRetry` is the Max number of retries while attempting to remediate an unhealthy machine.
121+
`maxRetry` is the maximum number of retries while attempting to remediate an unhealthy machine.
123122
A retry happens when a machine that was created as a replacement for an unhealthy machine also fails.
124123
For example, given a control plane with three machines M1, M2, M3:
125124

126125
- M1 become unhealthy; remediation happens, and M1-1 is created as a replacement.
127126
- If M1-1 (replacement of M1) has problems while bootstrapping it will become unhealthy, and then be
128-
remediated; such operation is considered a retry, remediation-retry #1.
127+
remediated. This operation is considered a retry - remediation-retry #1.
129128
- If M1-2 (replacement of M1-1) becomes unhealthy, remediation-retry #2 will happen, etc.
130129

131-
A retry could happen only after `retryPeriod` from the previous retry; if `retryPeriod` is not set (default),
132-
a retry will happen immediately.
130+
A retry will only happen after the `retryPeriod` from the previous retry has elapsed. If `retryPeriod` is not set (default), a retry will happen immediately.
133131

134-
If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) from the previous remediation expired,
135-
this is not considered a retry anymore because the new issue is assumed unrelated from the previous one.
132+
If a machine is marked as unhealthy after `minHealthyPeriod` (default 1h) has passed since the previous remediation this is no longer considered a retry because the new issue is assumed unrelated from the previous one.
136133

137-
If `maxRetry` is not set (default), the remedation will be retried infinitely.
134+
If `maxRetry` is not set (default), remediation will be retried infinitely.
135+
136+
<aside class="note">
137+
138+
<h1> Retry again once maxRetry is exhausted</h1>
139+
140+
If for some reasons you want to remediate once maxRetry is exhausted there are two options:
141+
- Temporarily increase `maxRetry` (recommended)
142+
- Remove the `controlplane.cluster.x-k8s.io/remediation-for` annotation from the unhealthy machine or decrease `retryCount` in the annotation value.
143+
144+
</aside>
138145

139146
## Remediation Short-Circuiting
140147

@@ -218,11 +225,11 @@ Before deploying a MachineHealthCheck, please familiarise yourself with the foll
218225

219226
- Only Machines owned by a MachineSet or a KubeadmControlPlane can be remediated by a MachineHealthCheck (since a MachineDeployment uses a MachineSet, then this includes Machines that are part of a MachineDeployment)
220227
- Machines managed by a KubeadmControlPlane are remediated according to [the delete-and-recreate guidelines described in the KubeadmControlPlane proposal](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20191017-kubeadm-based-control-plane.md#remediation-using-delete-and-recreate)
221-
- Following rules should be satisfied in order to start remediation of a control plane machine:
228+
- The following rules should be satisfied in order to start remediation of a control plane machine:
222229
- One of the following apply:
223230
- The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
224231
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
225-
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP to remediate more machines while the replacement for the previous machine is not yet created.
232+
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP from remediating more machines while the replacement for the previous machine is not yet created.
226233
- The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
227234
- Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP already initialized and with managed etcd)
228235
- If the Node for a Machine is removed from the cluster, a MachineHealthCheck will consider this Machine unhealthy and remediate it immediately

0 commit comments

Comments
 (0)