Skip to content

Commit 34dcc4e

Browse files
authored
Merge pull request #55175 from jeana-redhat/OSDOCS-4789-machine-deletion-hooks
[OSDOCS-4789]: Machine deletion hooks
2 parents 6f3b55c + 7962d98 commit 34dcc4e

15 files changed

+302
-26
lines changed

backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.adoc

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,11 +39,17 @@ Depending on the state of your unhealthy etcd member, use one of the following p
3939

4040
// Replacing an unhealthy etcd member whose machine is not running or whose node is not ready
4141
include::modules/restore-replace-stopped-etcd-member.adoc[leveloffset=+2]
42+
[role="_additional-resources"]
43+
.Additional resources
44+
* xref:../../machine_management/control_plane_machine_management/cpmso-troubleshooting.adoc#cpmso-ts-etcd-degraded_cpmso-troubleshooting[Recovering a degraded etcd Operator]
4245

4346
// Replacing an unhealthy etcd member whose etcd pod is crashlooping
4447
include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+2]
4548

4649
// Replacing an unhealthy baremetal stopped etcd member
4750
include::modules/restore-replace-stopped-baremetal-etcd-member.adoc[leveloffset=+2]
4851

49-
52+
[role="_additional-resources"]
53+
[id="additional-resources_replacing-unhealthy-etcd-member"]
54+
== Additional resources
55+
* xref:../../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]

backup_and_restore/index.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,10 @@ You might run into several situations where {product-title} does not work as ex
2929

3030
You can always recover from a disaster situation by xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restoring your cluster to its previous state] using the saved etcd snapshots.
3131

32+
[role="_additional-resources"]
33+
.Additional resources
34+
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]
35+
3236
[id="application-backup-restore-operations-overview"]
3337
== Application backup and restore operations
3438

216 KB
Loading

machine_management/control_plane_machine_management/cpmso-resiliency.adoc

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,8 @@ When possible, the control plane machine set spreads the control plane machines
1515

1616
//Failure domain platform support and configuration
1717
include::modules/cpmso-failure-domains-provider.adoc[leveloffset=+2]
18-
1918
[role="_additional-resources"]
2019
.Additional resources
21-
2220
* xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-failure-domain-aws_cpmso-configuration[Sample Amazon Web Services failure domain configuration]
2321

2422
* xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-failure-domain-gcp_cpmso-configuration[Sample Google Cloud Platform failure domain configuration]
@@ -30,8 +28,12 @@ include::modules/cpmso-failure-domains-balancing.adoc[leveloffset=+2]
3028

3129
//Recovery of the failed control plane machines
3230
include::modules/cpmso-control-plane-recovery.adoc[leveloffset=+1]
33-
3431
[role="_additional-resources"]
3532
.Additional resources
33+
* xref:../../machine_management/deploying-machine-health-checks.adoc#deploying-machine-health-checks[Deploying machine health checks]
3634

37-
* xref:../../machine_management/deploying-machine-health-checks.adoc#deploying-machine-health-checks[Deploying machine health checks]
35+
//Quorum protection with machine lifecycle hooks
36+
include::modules/machine-lifecycle-hook-deletion-etcd.adoc[leveloffset=+1]
37+
[role="_additional-resources"]
38+
.Additional resources
39+
* xref:../../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion_deleting-machine[Lifecycle hooks for the machine deletion phase]

machine_management/deleting-machine.adoc

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,22 @@ toc::[]
88

99
You can delete a specific machine.
1010

11+
//Deleting a specific machine
1112
include::modules/machine-delete.adoc[leveloffset=+1]
1213

14+
//Lifecycle hooks for the machine deletion phase
15+
include::modules/machine-lifecycle-hook-deletion.adoc[leveloffset=+1]
16+
17+
//Deletion lifecycle hook configuration
18+
include::modules/machine-lifecycle-hook-deletion-format.adoc[leveloffset=+2]
19+
20+
//Machine deletion lifecycle hook examples for Operator developers
21+
include::modules/machine-lifecycle-hook-deletion-uses.adoc[leveloffset=+2]
22+
23+
//Quorum protection with machine lifecycle hooks
24+
include::modules/machine-lifecycle-hook-deletion-etcd.adoc[leveloffset=+2]
25+
26+
1327
[role="_additional-resources"]
1428
[id="additional-resources_unhealthy-etcd-member"]
1529
== Additional resources

machine_management/manually-scaling-machineset.adoc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,8 @@ include::modules/machine-user-provisioned-limitations.adoc[leveloffset=+1]
2222
include::modules/machineset-manually-scaling.adoc[leveloffset=+1]
2323

2424
include::modules/machineset-delete-policy.adoc[leveloffset=+1]
25+
26+
[role="_additional-resources"]
27+
[id="additional-resources_manually-scaling-machineset"]
28+
== Additional resources
29+
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion_deleting-machine[Lifecycle hooks for the machine deletion phase]

machine_management/modifying-machineset.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@ If you need to scale a compute machine set without making other changes, see xre
1717

1818
include::modules/machineset-modifying.adoc[leveloffset=+1]
1919

20+
[role="_additional-resources"]
21+
.Additional resources
22+
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion_deleting-machine[Lifecycle hooks for the machine deletion phase]
23+
2024
[id="migrating-nodes-to-a-different-storage-domain-rhv_{context}"]
2125
== Migrating nodes to a different storage domain on {rh-virtualization}
2226

modules/machine-delete.adoc

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@
77
[id="machine-delete_{context}"]
88
= Deleting a specific machine
99

10-
You can delete a specific machine.
10+
You can delete a specific machine.
1111

1212
[IMPORTANT]
1313
====
1414
Do not delete a control plane machine unless your cluster uses a control plane machine set.
15-
====
15+
====
1616

1717
.Prerequisites
1818

@@ -22,24 +22,29 @@ Do not delete a control plane machine unless your cluster uses a control plane m
2222
2323
.Procedure
2424

25-
. View the machines that are in the cluster and identify the one to delete:
25+
. View the machines that are in the cluster by running the following command:
2626
+
2727
[source,terminal]
2828
----
2929
$ oc get machine -n openshift-machine-api
3030
----
3131
+
32-
The command output contains a list of machines in the `<clusterid>-worker-<cloud_region>` format.
32+
The command output contains a list of machines in the `<clusterid>-<role>-<cloud_region>` format.
33+
34+
. Identify the machine that you want to delete.
3335

34-
. Delete the machine:
36+
. Delete the machine by running the following command:
3537
+
3638
[source,terminal]
3739
----
3840
$ oc delete machine <machine> -n openshift-machine-api
3941
----
40-
4142
+
4243
[IMPORTANT]
4344
====
44-
By default, the machine controller tries to drain the node that is backed by the machine until it succeeds. In some situations, such as with a misconfigured pod disruption budget, the drain operation might not be able to succeed in preventing the machine from being deleted. You can skip draining the node by annotating "machine.openshift.io/exclude-node-draining" in a specific machine. If the machine being deleted belongs to a compute machine set, a new machine is immediately created to satisfy the specified number of replicas.
45+
By default, the machine controller tries to drain the node that is backed by the machine until it succeeds. In some situations, such as with a misconfigured pod disruption budget, the drain operation might not be able to succeed. If the drain operation fails, the machine controller cannot proceed removing the machine.
46+
47+
You can skip draining the node by annotating `machine.openshift.io/exclude-node-draining` in a specific machine.
4548
====
49+
+
50+
If the machine that you delete belongs to a machine set, a new machine is immediately created to satisfy the specified number of replicas.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/deleting-machine.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="machine-lifecycle-hook-deletion-etcd_{context}"]
7+
= Quorum protection with machine lifecycle hooks
8+
9+
For {product-title} clusters that use the Machine API Operator, the etcd Operator uses lifecycle hooks for the machine deletion phase to implement a quorum protection mechanism.
10+
11+
By using a `preDrain` lifecycle hook, the etcd Operator can control when the pods on a control plane machine are drained and removed. To protect etcd quorum, the etcd Operator prevents the removal of an etcd member until it migrates that member onto a new node within the cluster.
12+
13+
This mechanism allows the etcd Operator precise control over the members of the etcd quorum and allows the Machine API Operator to safely create and remove control plane machines without specific operational knowledge of the etcd cluster.
14+
15+
[id="machine-lifecycle-hook-deletion-etcd-order_{context}"]
16+
== Control plane deletion with quorum protection processing order
17+
18+
When a control plane machine is replaced on a cluster that uses a control plane machine set, the cluster temporarily has four control plane machines. When the fourth control plane node joins the cluster, the etcd Operator starts a new etcd member on the replacement node. When the etcd Operator observes that the old control plane machine is marked for deletion, it stops the etcd member on the old node and promotes the replacement etcd member to join the quorum of the cluster.
19+
20+
The control plane machine `Deleting` phase proceeds in the following order:
21+
22+
. A control plane machine is slated for deletion.
23+
. The control plane machine enters the `Deleting` phase.
24+
. To satisfy the `preDrain` lifecycle hook, the etcd Operator takes the following actions:
25+
+
26+
--
27+
.. The etcd Operator waits until a fourth control plane machine is added to the cluster as an etcd member. This new etcd member has a state of `Running` but not `ready` until it receives the full database update from the etcd leader.
28+
.. When the new etcd member receives the full database update, the etcd Operator promotes the new etcd member to a voting member and removes the old etcd member from the cluster.
29+
--
30+
After this transition is complete, it is safe for the old etcd pod and its data to be removed, so the `preDrain` lifecycle hook is removed.
31+
. The control plane machine status condition `Drainable` is set to `True`.
32+
. The machine controller attempts to drain the node that is backed by the control plane machine.
33+
** If draining fails, `Drained` is set to `False` and the machine controller attempts to drain the node again.
34+
** If draining succeeds, `Drained` is set to `True`.
35+
. The control plane machine status condition `Drained` is set to `True`.
36+
. If no other Operators have added a `preTerminate` lifecycle hook, the control plane machine status condition `Terminable` is set to `True`.
37+
. The machine controller removes the instance from the infrastructure provider.
38+
. The machine controller deletes the `Node` object.
39+
40+
.YAML snippet demonstrating the etcd quorum protection `preDrain` lifecycle hook
41+
[source,yaml]
42+
----
43+
apiVersion: machine.openshift.io/v1beta1
44+
kind: ControlPlaneMachineSet
45+
metadata:
46+
...
47+
spec:
48+
lifecycleHooks:
49+
preDrain:
50+
- name: EtcdQuorumOperator <1>
51+
owner: clusteroperator/etcd <2>
52+
...
53+
----
54+
<1> The name of the `preDrain` lifecycle hook.
55+
<2> The hook-implementing controller that manages the `preDrain` lifecycle hook.
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/deleting-machine.adoc
4+
5+
:_content-type: REFERENCE
6+
[id="machine-lifecycle-hook-deletion-format_{context}"]
7+
= Deletion lifecycle hook configuration
8+
9+
The following YAML snippets demonstrate the format and placement of deletion lifecycle hook configurations within a machine set:
10+
11+
.YAML snippet demonstrating a `preDrain` lifecycle hook
12+
[source,yaml]
13+
----
14+
apiVersion: machine.openshift.io/v1beta1
15+
kind: MachineSet
16+
metadata:
17+
...
18+
spec:
19+
lifecycleHooks:
20+
preDrain:
21+
- name: <hook-name> <1>
22+
owner: <hook-owner> <2>
23+
...
24+
----
25+
<1> The name of the `preDrain` lifecycle hook.
26+
<2> The hook-implementing controller that manages the `preDrain` lifecycle hook.
27+
28+
.YAML snippet demonstrating a `preTerminate` lifecycle hook
29+
[source,yaml]
30+
----
31+
apiVersion: machine.openshift.io/v1beta1
32+
kind: MachineSet
33+
metadata:
34+
...
35+
spec:
36+
lifecycleHooks:
37+
preTerminate:
38+
- name: <hook-name> <1>
39+
owner: <hook-owner> <2>
40+
...
41+
----
42+
<1> The name of the `preTerminate` lifecycle hook.
43+
<2> The hook-implementing controller that that manages the `preTerminate` lifecycle hook.
44+
45+
[discrete]
46+
[id="machine-lifecycle-hook-deletion-example_{context}"]
47+
== Example lifecycle hook configuration
48+
49+
The following example demonstrates the implementation of multiple fictional lifecycle hooks that interrupt the machine deletion process:
50+
51+
.Example configuration for lifecycle hooks
52+
[source,yaml]
53+
----
54+
apiVersion: machine.openshift.io/v1beta1
55+
kind: MachineSet
56+
metadata:
57+
...
58+
spec:
59+
lifecycleHooks:
60+
preDrain: <1>
61+
- name: MigrateImportantApp
62+
owner: my-app-migration-controller
63+
preTerminate: <2>
64+
- name: BackupFileSystem
65+
owner: my-backup-controller
66+
- name: CloudProviderSpecialCase
67+
owner: my-custom-storage-detach-controller <3>
68+
- name: WaitForStorageDetach
69+
owner: my-custom-storage-detach-controller
70+
...
71+
----
72+
<1> A `preDrain` lifecycle hook stanza that contains a single lifecycle hook.
73+
<2> A `preTerminate` lifecycle hook stanza that contains three lifecycle hooks.
74+
<3> A hook-implementing controller that manages two `preTerminate` lifecycle hooks: `CloudProviderSpecialCase` and `WaitForStorageDetach`.

0 commit comments

Comments
 (0)