Skip to content

Commit baf8896

Browse files
committed
Address comments
1 parent 7057665 commit baf8896

File tree

2 files changed

+80
-78
lines changed

2 files changed

+80
-78
lines changed

content/en/blog/_posts/2022-12-06-non-graceful-node-shutdown-to-beta.md

Lines changed: 0 additions & 78 deletions
This file was deleted.
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta"
4+
date: 2022-12-16T10:00:00-08:00
5+
slug: kubernetes-1-26-non-graceful-node-shutdown-beta
6+
---
7+
8+
**Author:** Xing Yang and Ashutosh Kumar (VMware)
9+
10+
Kubernetes v1.24 introduced [alpha support](https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/) for [Non-Graceful Node Shutdown](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown).
11+
In Kubernetes v1.26, this feature moves to beta. This feature allows stateful workloads to failover to a different node after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.
12+
13+
## What is a Node Shutdown in Kubernetes
14+
15+
In a Kubernetes cluster, it is possible for a node to shut down. This could happen either in a planned way or it could happen unexpectedly. You may plan for a security patch, or a kernel upgrade and need to reboot the node, or it may shut down due to preemption of VM instances. A node may also shut down due to a hardware failure or a software problem.
16+
17+
To trigger a node shutdown, you could run a shutdown or poweroff command or physically press a button to power off a machine.
18+
19+
A node shutdown could lead to workload failure if the node is not drained before the shutdown.
20+
21+
In the following, we will describe what is a graceful node shutdown and what is a non-graceful node shutdown.
22+
23+
## What is a Graceful Node Shutdown
24+
25+
[Graceful Node Shutdown](https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/) feature allows the kubelet to detect a node shutdown event, properly terminate the pods, and release resources before the actual shutdown. [Critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical) will be terminated after all the regular pods are terminated to ensure that the essential functions of an application can continue to work as long as possible.
26+
27+
## What is a Non-Graceful Node Shutdown
28+
29+
A Node Shutdown can be `Graceful` only if kubelet's Node Shutdown Manager can detect the node shutdown action. However, there are cases where kubelet's Node Shutdown Manager may not detect a node shutdown action. This could happen because the shutdown command does not trigger the [Inhibitor Locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit) mechanism used by kubelet or because of a user error, i.e., the `ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods` are not configured correctly.
30+
31+
When a node is shut down but not detected by kubelet's Node Shutdown Manager, it becomes a non-graceful node shutdown. Non-graceful node shutdown is a problem for stateful apps. If a node containing a pod that is part of a StatefulSet is shut down in a non-graceful way, the pod will be stuck in `Terminating` status and cannot move to a new running node. Similarly, pods that ReplicaSets create as part of a Deployment will be stuck in `Terminating` status on the shutdown node forever. However, they will come up on another running node and be stuck in the `ContainerCreating` status on the new node. As a result, the application will not function properly. If the original shutdown node comes up, the old pod will be deleted by kubelet, and the new pod will be created successfully on a different running node.
32+
33+
## What’s new in Beta
34+
35+
With the promotion of the Non-Graceful Node Shutdown feature to beta, the feature gate `NodeOutOfServiceVolumeDetach` is enabled by default on `kube-controller-manager` instead of being opt-in.
36+
37+
New metrics `force_delete_pods_total`, number of pods that are being forcefully deleted since the Pod GC Controller started, and `force_delete_pod_errors_total`, number of errors encountered when forcefully deleting the pods since the Pod GC Controller started, are also added in the Pod GC Controller.
38+
39+
## How does it work
40+
41+
In the case of a node shutdown, if a graceful shutdown is not working or the node is in a non-recoverable state due to hardware failure or broken OS, the user can manually add an `out-of-service` taint on the Node. For example, this can be `node.kubernetes.io/out-of-service=nodeshutdown:NoExecute` or `node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule`. This will trigger pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node.
42+
43+
```
44+
kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
45+
```
46+
47+
Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting), either because the user intentionally shut it down or the node is down due to hardware failures, OS issues, etc.
48+
49+
Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered.
50+
51+
## What’s next?
52+
53+
Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to GA in either 1.27 or 1.28.
54+
55+
This feature requires a user to manually add a taint to the node to trigger the failover of workloads and remove the taint after the node is recovered. In the future, we plan to find ways to automatically detect and fence nodes that are shut down or in a non-recoverable state and fail their workloads over to another node.
56+
57+
## How can I learn more?
58+
59+
Check out additional documentation on this feature [here](https://kubernetes.io/docs/concepts/architecture/nodes/#non-graceful-node-shutdown).
60+
61+
## How to get involved?
62+
63+
We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature:
64+
65+
* Michelle Au ([msau42](https://github.com/msau42))
66+
* Derek Carr ([derekwaynecarr](https://github.com/derekwaynecarr))
67+
* Danielle Endocrimes ([endocrimes](https://github.com/endocrimes))
68+
* Tim Hockin ([thockin](https://github.com/thockin))
69+
* Ashutosh Kumar ([sonasingh46](https://github.com/sonasingh46))
70+
* Hemant Kumar ([gnufied](https://github.com/gnufied))
71+
* Yuiko Mouri([YuikoTakada](https://github.com/YuikoTakada))
72+
* Mrunal Patel ([mrunalp](https://github.com/mrunalp))
73+
* David Porter ([bobbypage](https://github.com/bobbypage))
74+
* Yassine Tijani ([yastij](https://github.com/yastij))
75+
* Jing Xu ([jingxu97](https://github.com/jingxu97))
76+
* Xing Yang ([xing-yang](https://github.com/xing-yang))
77+
78+
There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the [KEP](https://github.com/kubernetes/enhancements/pull/1116) and implementation over the last couple of years.
79+
80+
This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.

0 commit comments

Comments
 (0)