Skip to content

Commit 3005445

Browse files
authored
document non graceful node shutdown feature (#32406)
Signed-off-by: Ashutosh Kumar <[email protected]>
1 parent 15e978d commit 3005445

File tree

2 files changed

+65
-0
lines changed
  • content/en/docs

2 files changed

+65
-0
lines changed

content/en/docs/concepts/architecture/nodes.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -450,6 +450,56 @@ Reason: Terminated
450450
Message: Pod was terminated in response to imminent node shutdown.
451451
```
452452

453+
{{< /note >}}
454+
455+
## Non Graceful node shutdown {#non-graceful-node-shutdown}
456+
457+
{{< feature-state state="alpha" for_k8s_version="v1.24" >}}
458+
459+
A node shutdown action may not be detected by kubelet's Node Shutdown Mananger,
460+
either because the command does not trigger the inhibitor locks mechanism used by
461+
kubelet or because of a user error, i.e., the ShutdownGracePeriod and
462+
ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above
463+
section [Graceful Node Shutdown](#graceful-node-shutdown) for more details.
464+
465+
When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods
466+
that are part of a StatefulSet will be stuck in terminating status on
467+
the shutdown node and cannot move to a new running node. This is because kubelet on
468+
the shutdown node is not available to delete the pods so the StatefulSet cannot
469+
create a new pod with the same name. If there are volumes used by the pods, the
470+
VolumeAttachments will not be deleted from the original shutdown node so the volumes
471+
used by these pods cannot be attached to a new running node. As a result, the
472+
application running on the StatefulSet cannot function properly. If the original
473+
shutdown node comes up, the pods will be deleted by kubelet and new pods will be
474+
created on a different running node. If the original shutdown node does not come up,
475+
these pods will be stuck in terminating status on the shutdown node forever.
476+
477+
To mitigate the above situation, a user can manually add the taint `node
478+
kubernetes.io/out-of-service` with either `NoExecute` or `NoSchedule` effect to
479+
a Node marking it out-of-service.
480+
If the `NodeOutOfServiceVolumeDetach` [feature gate](/docs/reference/
481+
command-line-tools-reference/feature-gates/) is enabled on
482+
`kube-controller-manager`, and a Node is marked out-of-service with this taint, the
483+
pods on the node will be forcefully deleted if there are no matching tolerations on
484+
it and volume detach operations for the pods terminating on the node will happen
485+
immediately. This allows the Pods on the out-of-service node to recover quickly on a
486+
different node.
487+
488+
During a non-graceful shutdown, Pods are terminated in the two phases:
489+
490+
1. Force delete the Pods that do not have matching `out-of-service` tolerations.
491+
2. Immediately perform detach volume operation for such pods.
492+
493+
494+
{{< note >}}
495+
- Before adding the taint `node.kubernetes.io/out-of-service` , it should be verified
496+
that the node is already in shutdown or power off state (not in the middle of
497+
restarting).
498+
- The user is required to manually remove the out-of-service taint after the pods are
499+
moved to a new node and the user has checked that the shutdown node has been
500+
recovered since the user was the one who originally added the taint.
501+
502+
453503
{{< /note >}}
454504

455505
### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown}

content/en/docs/reference/labels-annotations-taints/_index.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,21 @@ Example: `node.kubernetes.io/pid-pressure:NoSchedule`
381381

382382
The kubelet checks D-value of the size of `/proc/sys/kernel/pid_max` and the PIDs consumed by Kubernetes on a node to get the number of available PIDs that referred to as the `pid.available` metric. The metric is then compared to the corresponding threshold that can be set on the kubelet to determine if the node condition and taint should be added/removed.
383383

384+
### node.kubernetes.io/out-of-service
385+
386+
Example: `node.kubernetes.io/out-of-service:NoExecute`
387+
388+
A user can manually add the taint to a Node marking it out-of-service. If the `NodeOutOfServiceVolumeDetach`
389+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled on
390+
`kube-controller-manager`, and a Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted if there are no matching tolerations on it and volume detach operations for the pods terminating on the node will happen immediately. This allows the Pods on the out-of-service node to recover quickly on a different node.
391+
392+
{{< caution >}}
393+
Refer to
394+
[Non-graceful node shutdown](/docs/concepts/architecture/nodes/#non-graceful-node-shutdown)
395+
for further details about when and how to use this taint.
396+
{{< /caution >}}
397+
398+
384399
### node.cloudprovider.kubernetes.io/uninitialized
385400

386401
Example: `node.cloudprovider.kubernetes.io/uninitialized:NoSchedule`

0 commit comments

Comments
 (0)