Skip to content

Commit f98ccc6

Browse files
authored
Merge pull request #40448 from windsonsea/tweaky
tweak line wrappings in blog: protect-pods-priorityclass
2 parents 639eee2 + 4cd6cf5 commit f98ccc6

File tree

1 file changed

+103
-38
lines changed
  • content/en/blog/_posts/2023-01-12-protect-mission-critical-pods-priorityclass

1 file changed

+103
-38
lines changed

content/en/blog/_posts/2023-01-12-protect-mission-critical-pods-priorityclass/index.md

Lines changed: 103 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -6,56 +6,92 @@ slug: protect-mission-critical-pods-priorityclass
66
description: "Pod priority and preemption help to make sure that mission-critical pods are up in the event of a resource crunch by deciding order of scheduling and eviction."
77
---
88

9-
109
**Author:** Sunny Bhambhani (InfraCloud Technologies)
1110

12-
Kubernetes has been widely adopted, and many organizations use it as their de-facto orchestration engine for running workloads that need to be created and deleted frequently.
11+
Kubernetes has been widely adopted, and many organizations use it as their de-facto
12+
orchestration engine for running workloads that need to be created and deleted frequently.
1313

14-
Therefore, proper scheduling of the pods is key to ensuring that application pods are up and running within the Kubernetes cluster without any issues. This article delves into the use cases around resource management by leveraging the [PriorityClass](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass) object to protect mission-critical or high-priority pods from getting evicted and making sure that the application pods are up, running, and serving traffic.
14+
Therefore, proper scheduling of the pods is key to ensuring that application pods
15+
are up and running within the Kubernetes cluster without any issues. This article
16+
delves into the use cases around resource management by leveraging the
17+
[PriorityClass](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass)
18+
object to protect mission-critical or high-priority pods from getting evicted and
19+
making sure that the application pods are up, running, and serving traffic.
1520

16-
## Resource management in Kubernetes
21+
## Resource management in Kubernetes
1722

18-
The control plane consists of multiple components, out of which the scheduler (usually the built-in [kube-scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/)) is one of the components which is responsible for assigning a node to a pod.
23+
The control plane consists of multiple components, out of which the scheduler
24+
(usually the built-in [kube-scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/))
25+
is one of the components which is responsible for assigning a node to a pod.
1926

20-
Whenever a pod is created, it enters a "pending" state, after which the scheduler determines which node is best suited for the placement of the new pod.
27+
Whenever a pod is created, it enters a "pending" state, after which the scheduler
28+
determines which node is best suited for the placement of the new pod.
2129

22-
In the background, the scheduler runs as an infinite loop looking for pods without a `nodeName` set that are [ready for scheduling](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/). For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.
30+
In the background, the scheduler runs as an infinite loop looking for pods without
31+
a `nodeName` set that are [ready for scheduling](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/).
32+
For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.
2333

2434
If the scheduler cannot find any node, the pod remains in the pending state, which is not ideal.
2535

2636
{{< note >}}
27-
To name a few, `nodeSelector` , `taints and tolerations` , `nodeAffinity` , the rank of nodes based on available resources (for example, CPU and memory), and several other criteria are used to determine the pod's placement.
37+
To name a few, `nodeSelector`, `taints and tolerations`, `nodeAffinity`, the rank
38+
of nodes based on available resources (for example, CPU and memory), and several
39+
other criteria are used to determine the pod's placement.
2840
{{< /note >}}
2941

30-
The below diagram, from point number 1 through 4, explains the request flow:
42+
The below diagram, from point number 1 through 4, explains the request flow:
3143

3244
{{< figure src=kube-scheduler.svg alt="A diagram showing the scheduling of three Pods that a client has directly created." title="Scheduling in Kubernetes">}}
3345

3446
## Typical use cases
3547

3648
Below are some real-life scenarios where control over the scheduling and eviction of pods may be required.
3749

38-
1. Let's say the pod you plan to deploy is critical, and you have some resource constraints. An example would be the DaemonSet of an infrastructure component like Grafana Loki. The Loki pods must run before other pods can on every node. In such cases, you could ensure resource availability by manually identifying and deleting the pods that are not required or by adding a new node to the cluster. Both these approaches are unsuitable since the former would be tedious to execute, and the latter could involve an expenditure of time and money.
39-
50+
1. Let's say the pod you plan to deploy is critical, and you have some resource
51+
constraints. An example would be the DaemonSet of an infrastructure component
52+
like Grafana Loki. The Loki pods must run before other pods can on every node.
53+
In such cases, you could ensure resource availability by manually identifying
54+
and deleting the pods that are not required or by adding a new node to the cluster.
55+
Both these approaches are unsuitable since the former would be tedious to execute,
56+
and the latter could involve an expenditure of time and money.
4057

41-
2. Another use case could be a single cluster that holds the pods for the below environments with associated priorities:
58+
2. Another use case could be a single cluster that holds the pods for the below
59+
environments with associated priorities:
4260
- Production (`prod`): top priority
4361
- Preproduction (`preprod`): intermediate priority
4462
- Development (`dev`): least priority
4563

46-
In the event of high resource consumption in the cluster, there is competition for CPU and memory resources on the nodes. While cluster-level autoscaling _may_ add more nodes, it takes time. In the interim, if there are no further nodes to scale the cluster, some Pods could remain in a Pending state, or the service could be degraded as they compete for resources. If the kubelet does evict a Pod from the node, that eviction would be random because the kubelet doesn’t have any special information about which Pods to evict and which to keep.
64+
In the event of high resource consumption in the cluster, there is competition
65+
for CPU and memory resources on the nodes. While cluster-level autoscaling _may_
66+
add more nodes, it takes time. In the interim, if there are no further nodes to
67+
scale the cluster, some Pods could remain in a Pending state, or the service could
68+
be degraded as they compete for resources. If the kubelet does evict a Pod from the
69+
node, that eviction would be random because the kubelet doesn’t have any special
70+
information about which Pods to evict and which to keep.
4771

48-
3. A third example could be a microservice backed by a queuing application or a database running into a resource crunch and the queue or database getting evicted. In such a case, all the other services would be rendered useless until the database can serve traffic again.
72+
3. A third example could be a microservice backed by a queuing application or a
73+
database running into a resource crunch and the queue or database getting evicted.
74+
In such a case, all the other services would be rendered useless until the database
75+
can serve traffic again.
4976

50-
There can also be other scenarios where you want to control the order of scheduling or order of eviction of pods.
77+
There can also be other scenarios where you want to control the order of
78+
scheduling or order of eviction of pods.
5179

5280
## PriorityClasses in Kubernetes
5381

54-
PriorityClass is a cluster-wide API object in Kubernetes and part of the `scheduling.k8s.io/v1` API group. It contains a mapping of the PriorityClass name (defined in `.metadata.name`) and an integer value (defined in `.value`). This represents the value that the scheduler uses to determine Pod's relative priority.
82+
PriorityClass is a cluster-wide API object in Kubernetes and part of the
83+
`scheduling.k8s.io/v1` API group. It contains a mapping of the PriorityClass
84+
name (defined in `.metadata.name`) and an integer value (defined in `.value`).
85+
This represents the value that the scheduler uses to determine Pod's relative priority.
5586

56-
Additionally, when you create a cluster using kubeadm or a managed Kubernetes service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses to safeguard the pods that are hosted on the control plane nodes. This ensures that critical cluster components such as CoreDNS and kube-proxy can run even if resources are constrained.
87+
Additionally, when you create a cluster using kubeadm or a managed Kubernetes
88+
service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses
89+
to safeguard the pods that are hosted on the control plane nodes. This ensures
90+
that critical cluster components such as CoreDNS and kube-proxy can run even if
91+
resources are constrained.
5792

58-
This availability of pods is achieved through the use of a special PriorityClass that ensures the pods are up and running and that the overall cluster is not affected.
93+
This availability of pods is achieved through the use of a special PriorityClass
94+
that ensures the pods are up and running and that the overall cluster is not affected.
5995

6096
```console
6197
$ kubectl get priorityclass
@@ -64,28 +100,41 @@ system-cluster-critical 2000000000 false 82m
64100
system-node-critical 2000001000 false 82m
65101
```
66102

67-
The diagram below shows exactly how it works with the help of an example, which will be detailed in the upcoming section.
103+
The diagram below shows exactly how it works with the help of an example,
104+
which will be detailed in the upcoming section.
68105

69106
{{< figure src="decision-tree.svg" alt="A flow chart that illustrates how the kube-scheduler prioritizes new Pods and potentially preempts existing Pods" title="Pod scheduling and preemption">}}
70107

71108
### Pod priority and preemption
72109

73-
[Pod preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption) is a Kubernetes feature that allows the cluster to preempt pods (removing an existing Pod in favor of a new Pod) on the basis of priority. [Pod priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/#pod-priority) indicates the importance of a pod relative to other pods while scheduling. If there aren't enough resources to run all the current pods, the scheduler tries to evict lower-priority pods over high-priority ones.
110+
[Pod preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption)
111+
is a Kubernetes feature that allows the cluster to preempt pods
112+
(removing an existing Pod in favor of a new Pod) on the basis of priority.
113+
[Pod priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/#pod-priority)
114+
indicates the importance of a pod relative to other pods while scheduling.
115+
If there aren't enough resources to run all the current pods, the scheduler tries
116+
to evict lower-priority pods over high-priority ones.
74117

75-
Also, when a healthy cluster experiences a node failure, typically, lower-priority pods get preempted to create room for higher-priority pods on the available node. This happens even if the cluster can bring up a new node automatically since pod creation is usually much faster than bringing up a new node.
118+
Also, when a healthy cluster experiences a node failure, typically, lower-priority
119+
pods get preempted to create room for higher-priority pods on the available node.
120+
This happens even if the cluster can bring up a new node automatically since pod
121+
creation is usually much faster than bringing up a new node.
76122

77123
### PriorityClass requirements
78124

79125
Before you set up PriorityClasses, there are a few things to consider.
80126

81-
1. Decide which PriorityClasses are needed. For instance, based on environment, type of pods, type of applications, etc.
82-
2. The default PriorityClass resource for your cluster. The pods without a `priorityClassName` will be treated as priority 0.
127+
1. Decide which PriorityClasses are needed. For instance, based on environment,
128+
type of pods, type of applications, etc.
129+
2. The default PriorityClass resource for your cluster. The pods without a
130+
`priorityClassName` will be treated as priority 0.
83131
3. Use a consistent naming convention for all PriorityClasses.
84132
4. Make sure that the pods for your workloads are running with the right PriorityClass.
85133

86134
## PriorityClass hands-on example
87135

88-
Let’s say there are 3 application pods: one for prod, one for preprod, and one for development. Below are three sample YAML manifest files for each of those.
136+
Let’s say there are 3 application pods: one for prod, one for preprod, and one
137+
for development. Below are three sample YAML manifest files for each of those.
89138

90139
```yaml
91140
---
@@ -167,6 +216,7 @@ prod-nginx 0/1 Pending 0 55s env=prod
167216
Bad news. The pod for the Production environment is still Pending and isn't serving any traffic.
168217

169218
Let's see why this is happening:
219+
170220
```console
171221
$ kubectl get events
172222
...
@@ -176,11 +226,13 @@ $ kubectl get events
176226

177227
In this example, there is only one worker node, and that node has a resource crunch.
178228

179-
Now, let's look at how PriorityClass can help in this situation since prod should be given higher priority than the other environments.
229+
Now, let's look at how PriorityClass can help in this situation since prod should be
230+
given higher priority than the other environments.
180231

181232
## PriorityClass API
182233

183-
Before creating PriorityClasses based on these requirements, let's see what a basic manifest for a PriorityClass looks like and outline some prerequisites:
234+
Before creating PriorityClasses based on these requirements, let's see what a basic
235+
manifest for a PriorityClass looks like and outline some prerequisites:
184236

185237
```yaml
186238
apiVersion: scheduling.k8s.io/v1
@@ -208,11 +260,14 @@ Below are some prerequisites for PriorityClasses:
208260
- There are two optional fields:
209261
- `globalDefault`: When true, this PriorityClass is used for pods where a `priorityClassName` is not specified.
210262
Only one PriorityClass with `globalDefault` set to true can exist in a cluster.
211-
If there is no PriorityClass defined with globalDefault set to true, all the pods with no priorityClassName defined will be treated with 0 priority (i.e. the least priority).
263+
If there is no PriorityClass defined with globalDefault set to true, all the pods
264+
with no priorityClassName defined will be treated with 0 priority (i.e. the least priority).
212265
- `description`: A string with a meaningful value so that people know when to use this PriorityClass.
213266

214267
{{< note >}}
215-
Adding a PriorityClass with `globalDefault` set to `true` does not mean it will apply the same to the existing pods that are already running. This will be applicable only to the pods that came into existence after the PriorityClass was created.
268+
Adding a PriorityClass with `globalDefault` set to `true` does not mean it will
269+
apply the same to the existing pods that are already running. This will be
270+
applicable only to the pods that came into existence after the PriorityClass was created.
216271
{{< /note >}}
217272

218273
### PriorityClass in action
@@ -264,9 +319,13 @@ system-cluster-critical 2000000000 false 82m
264319
system-node-critical 2000001000 false 82m
265320
```
266321

267-
The new PriorityClasses are in place now. A small change is needed in the pod manifest or pod template (in a ReplicaSet or Deployment). In other words, you need to specify the priority class name at `.spec.priorityClassName` (which is a string value).
322+
The new PriorityClasses are in place now. A small change is needed in the pod
323+
manifest or pod template (in a ReplicaSet or Deployment). In other words, you
324+
need to specify the priority class name at `.spec.priorityClassName` (which is a string value).
268325

269-
First update the previous production pod manifest file to have a PriorityClass assigned, then delete the Production pod and recreate it. You can't edit the priority class for a Pod that already exists.
326+
First update the previous production pod manifest file to have a PriorityClass
327+
assigned, then delete the Production pod and recreate it. You can't edit the
328+
priority class for a Pod that already exists.
270329

271330
In my cluster, when I tried this, here's what happened.
272331
First, that change seems successful; the status of pods has been updated:
@@ -279,7 +338,8 @@ preprod-nginx 1/1 Running 0 55s env=preprod
279338
prod-nginx 0/1 Pending 0 55s env=prod
280339
```
281340

282-
The dev-nginx pod is getting terminated. Once that is successfully terminated and there are enough resources for the prod pod, the control plane can schedule the prod pod:
341+
The dev-nginx pod is getting terminated. Once that is successfully terminated and
342+
there are enough resources for the prod pod, the control plane can schedule the prod pod:
283343

284344
```console
285345
Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
@@ -300,7 +360,9 @@ set any PriorityClass at all.
300360
However, you can use other Kubernetes features to make sure that the priorities you wanted
301361
are actually applied.
302362

303-
As an alpha feature, you can define a [ValidatingAdmissionPolicy](/blog/2022/12/20/validating-admission-policies-alpha/) and a ValidatingAdmissionPolicyBinding so that, for example,
363+
As an alpha feature, you can define a
364+
[ValidatingAdmissionPolicy](/blog/2022/12/20/validating-admission-policies-alpha/)
365+
and a ValidatingAdmissionPolicyBinding so that, for example,
304366
Pods that go into the `prod` namespace must use the `prod-pc` PriorityClass.
305367
With another ValidatingAdmissionPolicyBinding you ensure that the `preprod` namespace
306368
uses the `preprod-pc` PriorityClass, and so on.
@@ -315,15 +377,18 @@ users when they pick an unsuitable option.
315377

316378
## Summary
317379

318-
The above example and its events show you what this feature of Kubernetes brings to the table, along with several scenarios where you can use this feature. To reiterate, this helps ensure that mission-critical pods are up and available to serve the traffic and, in the case of a resource crunch, determines cluster behavior.
380+
The above example and its events show you what this feature of Kubernetes brings
381+
to the table, along with several scenarios where you can use this feature. To
382+
reiterate, this helps ensure that mission-critical pods are up and available to
383+
serve the traffic and, in the case of a resource crunch, determines cluster behavior.
319384

320-
It gives you some power to decide the order of scheduling and order of [preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption) for Pods. Therefore, you need to define the PriorityClasses sensibly.
385+
It gives you some power to decide the order of scheduling and order of
386+
[preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption)
387+
for Pods. Therefore, you need to define the PriorityClasses sensibly.
321388
For example, if you have a cluster autoscaler to add nodes on demand,
322389
make sure to run it with the `system-cluster-critical` PriorityClass. You don't want to
323390
get in a situation where the autoscaler has been preempted and there are no new nodes
324391
coming online.
325392

326-
If you have any queries or feedback, feel free to reach out to me on [LinkedIn](http://www.linkedin.com/in/sunnybhambhani).
327-
328-
329-
393+
If you have any queries or feedback, feel free to reach out to me on
394+
[LinkedIn](http://www.linkedin.com/in/sunnybhambhani).

0 commit comments

Comments
 (0)