docs: Pod priority based graceful node shutdown

bobbypage · bobbypage · commit 94e62c43bf56 · 2021-12-01T12:09:57.000-08:00
Signed-off-by: David Porter &lt;porterdavid@google.com&gt;
diff --git a/content/en/docs/concepts/architecture/nodes.md b/content/en/docs/concepts/architecture/nodes.md
@@ -424,20 +424,104 @@ for gracefully terminating normal pods, and the last 10 seconds would be
 reserved for terminating [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).
 
 {{< note >}}
-When pods were evicted during the graceful node shutdown, they are marked as failed.
-Running `kubectl get pods` shows the status of the the evicted pods as `Shutdown`.
+When pods were evicted during the graceful node shutdown, they are marked as shutdown.
+Running `kubectl get pods` shows the status of the the evicted pods as `Terminated`.
 And `kubectl describe pod` indicates that the pod was evicted because of node shutdown:
 
 ```
-Status:         Failed
-Reason:         Shutdown
-Message:        Node is shutting, evicting pods
+Reason:         Terminated
+Message:        Pod was terminated in response to imminent node shutdown.
 ```
 
-Failed pod objects will be preserved until explicitly deleted or [cleaned up by the GC](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).
-This is a change of behavior compared to abrupt node termination.
 {{< /note >}}
 
+### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown}
+
+{{< feature-state state="alpha" for_k8s_version="v1.23" >}}
+
+To provide more flexibility during graceful node shutdown around the ordering
+of pods during shutdown, graceful node shutdown honors the PriorityClass for
+Pods, provided that you enabled this feature in your cluster. The feature
+allows allows cluster administers to explicitly define the ordering of pods
+during graceful node shutdown based on [priority
+classes](docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass).
+
+The [Graceful Node Shutdown](#graceful-node-shutdown) feature, as described
+above, shuts down pods in two phases, non-critical pods, followed by critical
+pods. If additional flexibility is needed to explicitly define the ordering of
+pods during shutdown in a more granular way, pod priority based graceful
+shutdown can be used.
+
+When graceful node shutdown honors pod priorities, this makes it possible to do
+graceful node shutdown in multiple phases, each phase shutting down a
+particular priority class of pods. The kubelet can be configured with the exact
+phases and shutdown time per phase.
+
+Assuming the following custom pod [priority
+classes](docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass)
+in a cluster,
+
+|Pod priority class name|Pod priority class value|
+|-------------------------|------------------------|
+|`custom-class-a`         | 100000                 |
+|`custom-class-b`         | 10000                  |
+|`custom-class-c`         | 1000                   |
+|`regular/unset`          | 0                      |
+
+Within the [kubelet configuration](/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration)
+the settings for `shutdownGracePeriodByPodPriority` could look like:
+
+|Pod priority class value|Shutdown period|
+|------------------------|---------------|
+| 100000                 |10 seconds     |
+| 10000                  |180 seconds    |
+| 1000                   |120 seconds    |
+| 0                      |60 seconds     |
+
+The corresponding kubelet config YAML configuration would be:
+
+```yaml
+shutdownGracePeriodByPodPriority:
+  - priority: 100000
+    shutdownGracePeriodSeconds: 10
+  - priority: 10000
+    shutdownGracePeriodSeconds: 180
+  - priority: 1000
+    shutdownGracePeriodSeconds: 120
+  - priority: 0
+    shutdownGracePeriodSeconds: 60
+```
+
+The above table implies that any pod with priority value >= 100000 will get
+just 10 seconds to stop, any pod with value >= 10000 and < 100000 will get 180
+seconds to stop, any pod with value >= 1000 and < 10000 will get 120 seconds to stop.
+Finally, all other pods will get 60 seconds to stop.
+
+One doesn't have to specify values corresponding to all of the classes. For
+example, you could instead use these settings:
+
+|Pod priority class value|Shutdown period|
+|------------------------|---------------|
+| 100000                 |300 seconds    |
+| 1000                   |120 seconds    |
+| 0                      |60 seconds     |
+
+
+In the above case, the pods with custom-class-b will go into the same bucket
+as custom-class-c for shutdown.
+
+If there are no pods in a particular range, then the kubelet does not wait
+for pods in that priority range. Instead, the kubelet immediately skips to the
+next priority class value range.
+
+If this feature is enabled and no configuration is provided, then no ordering
+action will be taken.
+
+Using this feature, requires enabling the
+`GracefulNodeShutdownBasedOnPodPriority` feature gate, and setting the kubelet
+config's `ShutdownGracePeriodByPodPriority` to the desired configuration
+containing the pod priority class values and their respective shutdown periods.
+
 ## Swap memory management {#swap-memory}
 
 {{< feature-state state="alpha" for_k8s_version="v1.22" >}}