Skip to content

Commit 94e62c4

Browse files
committed
docs: Pod priority based graceful node shutdown
Signed-off-by: David Porter <[email protected]>
1 parent e211769 commit 94e62c4

File tree

1 file changed

+91
-7
lines changed
  • content/en/docs/concepts/architecture

1 file changed

+91
-7
lines changed

content/en/docs/concepts/architecture/nodes.md

Lines changed: 91 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -424,20 +424,104 @@ for gracefully terminating normal pods, and the last 10 seconds would be
424424
reserved for terminating [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).
425425

426426
{{< note >}}
427-
When pods were evicted during the graceful node shutdown, they are marked as failed.
428-
Running `kubectl get pods` shows the status of the the evicted pods as `Shutdown`.
427+
When pods were evicted during the graceful node shutdown, they are marked as shutdown.
428+
Running `kubectl get pods` shows the status of the the evicted pods as `Terminated`.
429429
And `kubectl describe pod` indicates that the pod was evicted because of node shutdown:
430430

431431
```
432-
Status: Failed
433-
Reason: Shutdown
434-
Message: Node is shutting, evicting pods
432+
Reason: Terminated
433+
Message: Pod was terminated in response to imminent node shutdown.
435434
```
436435

437-
Failed pod objects will be preserved until explicitly deleted or [cleaned up by the GC](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).
438-
This is a change of behavior compared to abrupt node termination.
439436
{{< /note >}}
440437

438+
### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown}
439+
440+
{{< feature-state state="alpha" for_k8s_version="v1.23" >}}
441+
442+
To provide more flexibility during graceful node shutdown around the ordering
443+
of pods during shutdown, graceful node shutdown honors the PriorityClass for
444+
Pods, provided that you enabled this feature in your cluster. The feature
445+
allows allows cluster administers to explicitly define the ordering of pods
446+
during graceful node shutdown based on [priority
447+
classes](docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass).
448+
449+
The [Graceful Node Shutdown](#graceful-node-shutdown) feature, as described
450+
above, shuts down pods in two phases, non-critical pods, followed by critical
451+
pods. If additional flexibility is needed to explicitly define the ordering of
452+
pods during shutdown in a more granular way, pod priority based graceful
453+
shutdown can be used.
454+
455+
When graceful node shutdown honors pod priorities, this makes it possible to do
456+
graceful node shutdown in multiple phases, each phase shutting down a
457+
particular priority class of pods. The kubelet can be configured with the exact
458+
phases and shutdown time per phase.
459+
460+
Assuming the following custom pod [priority
461+
classes](docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass)
462+
in a cluster,
463+
464+
|Pod priority class name|Pod priority class value|
465+
|-------------------------|------------------------|
466+
|`custom-class-a` | 100000 |
467+
|`custom-class-b` | 10000 |
468+
|`custom-class-c` | 1000 |
469+
|`regular/unset` | 0 |
470+
471+
Within the [kubelet configuration](/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration)
472+
the settings for `shutdownGracePeriodByPodPriority` could look like:
473+
474+
|Pod priority class value|Shutdown period|
475+
|------------------------|---------------|
476+
| 100000 |10 seconds |
477+
| 10000 |180 seconds |
478+
| 1000 |120 seconds |
479+
| 0 |60 seconds |
480+
481+
The corresponding kubelet config YAML configuration would be:
482+
483+
```yaml
484+
shutdownGracePeriodByPodPriority:
485+
- priority: 100000
486+
shutdownGracePeriodSeconds: 10
487+
- priority: 10000
488+
shutdownGracePeriodSeconds: 180
489+
- priority: 1000
490+
shutdownGracePeriodSeconds: 120
491+
- priority: 0
492+
shutdownGracePeriodSeconds: 60
493+
```
494+
495+
The above table implies that any pod with priority value >= 100000 will get
496+
just 10 seconds to stop, any pod with value >= 10000 and < 100000 will get 180
497+
seconds to stop, any pod with value >= 1000 and < 10000 will get 120 seconds to stop.
498+
Finally, all other pods will get 60 seconds to stop.
499+
500+
One doesn't have to specify values corresponding to all of the classes. For
501+
example, you could instead use these settings:
502+
503+
|Pod priority class value|Shutdown period|
504+
|------------------------|---------------|
505+
| 100000 |300 seconds |
506+
| 1000 |120 seconds |
507+
| 0 |60 seconds |
508+
509+
510+
In the above case, the pods with custom-class-b will go into the same bucket
511+
as custom-class-c for shutdown.
512+
513+
If there are no pods in a particular range, then the kubelet does not wait
514+
for pods in that priority range. Instead, the kubelet immediately skips to the
515+
next priority class value range.
516+
517+
If this feature is enabled and no configuration is provided, then no ordering
518+
action will be taken.
519+
520+
Using this feature, requires enabling the
521+
`GracefulNodeShutdownBasedOnPodPriority` feature gate, and setting the kubelet
522+
config's `ShutdownGracePeriodByPodPriority` to the desired configuration
523+
containing the pod priority class values and their respective shutdown periods.
524+
441525
## Swap memory management {#swap-memory}
442526

443527
{{< feature-state state="alpha" for_k8s_version="v1.22" >}}

0 commit comments

Comments
 (0)