Skip to content

Commit 8a30fc1

Browse files
authored
Merge pull request #7855 from fabriziopandini/improve-kcp-remediation
📖 Amend KCP proposal with remediation while provisioning the CP
2 parents 2961840 + ce3e77f commit 8a30fc1

File tree

1 file changed

+14
-5
lines changed

1 file changed

+14
-5
lines changed

docs/proposals/20191017-kubeadm-based-control-plane.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -472,12 +472,20 @@ When `MaxSurge` is set to 0 the rollout algorithm is as follows:
472472
for additional details. When there are multiple machines that are marked for remediation, the oldest one will be remediated first.
473473

474474
- Following rules should be satisfied in order to start remediation
475-
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
476-
- The number of replicas MUST be equal to or greater than the desired replicas. This rule ensures that when the cluster
477-
is missing replicas, we skip remediation and instead perform regular scale up/rollout operations first.
475+
- One of the following apply:
476+
- The cluster MUST not be initialized yet (the failure happens before KCP reaches the initialized state)
477+
- The cluster MUST have at least two control plane machines, because this is the smallest cluster size that can be remediated.
478+
- Previous remediation (delete and re-create) MUST have been completed. This rule prevents KCP to remediate more machines while the
479+
replacement for the previous machine is not yet created.
478480
- The cluster MUST have no machines with a deletion timestamp. This rule prevents KCP taking actions while the cluster is in a transitional state.
479481
- Remediation MUST preserve etcd quorum. This rule ensures that we will not remove a member that would result in etcd
480-
losing a majority of members and thus become unable to field new requests.
482+
losing a majority of members and thus become unable to field new requests (note: this rule applies only to CP with at least replicas)
483+
484+
- Additionally following opt-in safeguards will be put in place:
485+
- If we are remediating the same machine (delete, re-create, replacement machine gets unhealthy), it will be possible
486+
to define a maximum number of retries, thus preventing unnecessary load on infrastructure provider e.g. in case of quota problems.
487+
- If we are remediating the same machine (delete, re-create, replacement machine gets unhealthy), it will be possible
488+
to define a delay between each retry, thus allowing the infrastructure provider to stabilize in case of temporary problems.
481489

482490
- When all the conditions for starting remediation are satisfied, KCP temporarily suspend any operation in progress
483491
in order to perform remediation.
@@ -634,4 +642,5 @@ For the purposes of designing upgrades, two existing lifecycle managers were exa
634642
- [x] 12/04/2019: Initial stubbed KubeadmControlPlane controller added [#1826](https://github.com/kubernetes-sigs/cluster-api/pull/1826)
635643
- [x] 07/09/2020: Document updated to reflect changes up to v0.3.9 release
636644
- [x] 22/09/2020: KCP remediation added
637-
- [x] XX/XX/2020: KCP rollout strategies added
645+
- [x] 10/05/2021: Support for remediation of failures while upgrading 1 node CP
646+
- [x] 05/01/2022: Support for remediation while provisioning the CP (both first CP and CP machines while current replica < desired replica); Allow control of remediation retry behavior.

0 commit comments

Comments
 (0)