Skip to content

Commit 5b70eda

Browse files
Merge pull request #65094 from nalhadef/OSBUGS-9229
OCPBUGS#9229: Gracefully shutting down (3rd attempt)
2 parents f4d1028 + 8bebf64 commit 5b70eda

File tree

1 file changed

+27
-46
lines changed

1 file changed

+27
-46
lines changed

modules/graceful-shutdown.adoc

Lines changed: 27 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -20,83 +20,64 @@ You can shut down a cluster until a year from the installation date and expect i
2020
2121
.Procedure
2222

23-
. If you plan to shut down the cluster for an extended period of time, determine the date that cluster certificates expire.
24-
+
25-
You must restart the cluster prior to the date that certificates expire. As the cluster restarts, the process might require you to manually approve the pending certificate signing requests (CSRs) to recover kubelet certificates.
26-
27-
.. Check the expiration date for the `kube-apiserver-to-kubelet-signer` CA certificate:
23+
. If you are shutting the cluster down for an extended period, determine the date on which certificates expire and run the following command:
2824
+
2925
[source,terminal]
3026
----
31-
$ oc -n openshift-kube-apiserver-operator get secret kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}{"\n"}'
27+
$ oc -n openshift-kube-apiserver-operator get secret kube-apiserver-to-kubelet-signer -o jsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
3228
----
3329
+
3430
.Example output
3531
[source,terminal]
3632
----
37-
2023-08-05T14:37:50Z
38-
----
39-
40-
.. Check the expiration date for the kubelet certificates:
41-
42-
... Start a debug session for a control plane node by running the following command:
43-
+
44-
[source,terminal]
45-
----
46-
$ oc debug node/<node_name>
33+
2022-08-05T14:37:50Zuser@user:~ $ <1>
4734
----
35+
<1> To ensure that the cluster can restart gracefully, plan to restart it on or before the specified date. As the cluster restarts, the process might require you to manually approve the pending certificate signing requests (CSRs) to recover kubelet certificates.
4836

49-
... Change your root directory to `/host` by running the following command:
37+
. Mark all the nodes in the cluster as unschedulable. You can do this from your cloud provider's web console, or by running the following loop:
5038
+
5139
[source,terminal]
5240
----
53-
sh-4.4# chroot /host
54-
----
55-
56-
... Check the kubelet client certificate expiration date by running the following command:
57-
+
58-
[source,terminal]
59-
----
60-
sh-5.1# openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -enddate
41+
$ for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do echo ${node} ; oc adm cordon ${node} ; done
6142
----
6243
+
6344
.Example output
6445
[source,terminal]
6546
----
66-
notAfter=Jun 6 10:50:07 2023 GMT
67-
----
68-
69-
... Check the kubelet server certificate expiration date by running the following command:
70-
+
47+
ci-ln-mgdnf4b-72292-n547t-master-0
48+
node/ci-ln-mgdnf4b-72292-n547t-master-0 cordoned
49+
ci-ln-mgdnf4b-72292-n547t-master-1
50+
node/ci-ln-mgdnf4b-72292-n547t-master-1 cordoned
51+
ci-ln-mgdnf4b-72292-n547t-master-2
52+
node/ci-ln-mgdnf4b-72292-n547t-master-2 cordoned
53+
ci-ln-mgdnf4b-72292-n547t-worker-a-s7ntl
54+
node/ci-ln-mgdnf4b-72292-n547t-worker-a-s7ntl cordoned
55+
ci-ln-mgdnf4b-72292-n547t-worker-b-cmc9k
56+
node/ci-ln-mgdnf4b-72292-n547t-worker-b-cmc9k cordoned
57+
ci-ln-mgdnf4b-72292-n547t-worker-c-vcmtn
58+
node/ci-ln-mgdnf4b-72292-n547t-worker-c-vcmtn cordoned
59+
----
60+
61+
. Evacuate the pods using the following method:
7162
[source,terminal]
72-
----
73-
sh-5.1# openssl x509 -in /var/lib/kubelet/pki/kubelet-server-current.pem -noout -enddate
74-
----
7563
+
76-
.Example output
77-
[source,terminal]
7864
----
79-
notAfter=Jun 6 10:50:07 2023 GMT
65+
$ for node in $(oc get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[*].metadata.name}'); do echo ${node} ; oc adm drain ${node} --delete-emptydir-data --ignore-daemonsets=true --timeout=15s ; done
8066
----
8167

82-
... Exit the debug session.
83-
84-
... Repeat these steps to check certificate expiration dates on all control plane nodes. To ensure that the cluster can restart gracefully, plan to restart it before the earliest certificate expiration date.
85-
86-
. Shut down all of the nodes in the cluster. You can do this from your cloud provider's web console, or run the following loop:
68+
. Shut down all of the nodes in the cluster. You can do this from your cloud provider’s web console, or by running the following loop:
8769
+
8870
[source,terminal]
8971
----
90-
$ for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do oc debug node/${node} -- chroot /host shutdown -h 1; done <1>
72+
$ for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do oc debug node/${node} -- chroot /host shutdown -h 1; done
9173
----
92-
<1> `-h 1` indicates how long, in minutes, this process lasts before the control-plane nodes are shut down. For large-scale clusters with 10 nodes or more, set to 10 minutes or longer to make sure all the compute nodes have time to shut down first.
9374
+
9475
.Example output
76+
[source,terminal]
9577
----
9678
Starting pod/ip-10-0-130-169us-east-2computeinternal-debug ...
9779
To use host binaries, run `chroot /host`
9880
Shutdown scheduled for Mon 2021-09-13 09:36:17 UTC, use 'shutdown -c' to cancel.
99-
10081
Removing debug pod ...
10182
Starting pod/ip-10-0-150-116us-east-2computeinternal-debug ...
10283
To use host binaries, run `chroot /host`
@@ -108,6 +89,7 @@ Shutting down the nodes using one of these methods allows pods to terminate grac
10889
[NOTE]
10990
====
11091
Adjust the shut down time to be longer for large-scale clusters:
92+
11193
[source,terminal]
11294
----
11395
$ for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do oc debug node/${node} -- chroot /host shutdown -h 10; done
@@ -117,7 +99,6 @@ $ for node in $(oc get nodes -o jsonpath='{.items[*].metadata.name}'); do oc deb
11799
[NOTE]
118100
====
119101
It is not necessary to drain control plane nodes of the standard pods that ship with {product-title} prior to shutdown.
120-
121102
Cluster administrators are responsible for ensuring a clean restart of their own workloads after the cluster is restarted. If you drained control plane nodes prior to shutdown because of custom workloads, you must mark the control plane nodes as schedulable before the cluster will be functional again after restart.
122103
====
123104

@@ -126,4 +107,4 @@ Cluster administrators are responsible for ensuring a clean restart of their own
126107
[IMPORTANT]
127108
====
128109
If you deployed your cluster on a cloud-provider platform, do not shut down, suspend, or delete the associated cloud resources. If you delete the cloud resources of a suspended virtual machine, {product-title} might not restore successfully.
129-
====
110+
====

0 commit comments

Comments
 (0)