Skip to content

Commit fe22087

Browse files
authored
Merge pull request #42178 from sftim/20230724_revise_node_pressure_eviction_concept
Revise node pressure eviction concept
2 parents 3613a65 + a556074 commit fe22087

File tree

1 file changed

+95
-70
lines changed

1 file changed

+95
-70
lines changed

content/en/docs/concepts/scheduling-eviction/node-pressure-eviction.md

Lines changed: 95 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -12,28 +12,45 @@ When one or more of these resources reach specific consumption levels, the
1212
kubelet can proactively fail one or more pods on the node to reclaim resources
1313
and prevent starvation.
1414

15-
During a node-pressure eviction, the kubelet sets the `PodPhase` for the
16-
selected pods to `Failed`. This terminates the pods.
15+
During a node-pressure eviction, the kubelet sets the [phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase) for the
16+
selected pods to `Failed`, and terminates the Pod.
1717

1818
Node-pressure eviction is not the same as
1919
[API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/).
2020

21-
The kubelet does not respect your configured `PodDisruptionBudget` or the pod's
21+
The kubelet does not respect your configured {{<glossary_tooltip term_id="pod-disruption-budget" text="PodDisruptionBudget">}}
22+
or the pod's
2223
`terminationGracePeriodSeconds`. If you use [soft eviction thresholds](#soft-eviction-thresholds),
2324
the kubelet respects your configured `eviction-max-pod-grace-period`. If you use
24-
[hard eviction thresholds](#hard-eviction-thresholds), it uses a `0s` grace period for termination.
25+
[hard eviction thresholds](#hard-eviction-thresholds), the kubelet uses a `0s` grace period (immediate shutdown) for termination.
2526

26-
If the pods are managed by a {{< glossary_tooltip text="workload" term_id="workload" >}}
27-
resource (such as {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
28-
or {{< glossary_tooltip text="Deployment" term_id="deployment" >}}) that
29-
replaces failed pods, the control plane or `kube-controller-manager` creates new
30-
pods in place of the evicted pods.
27+
## Self healing behavior
3128

32-
{{<note>}}
3329
The kubelet attempts to [reclaim node-level resources](#reclaim-node-resources)
3430
before it terminates end-user pods. For example, it removes unused container
3531
images when disk resources are starved.
36-
{{</note>}}
32+
33+
If the pods are managed by a {{< glossary_tooltip text="workload" term_id="workload" >}}
34+
management object (such as {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
35+
or {{< glossary_tooltip text="Deployment" term_id="deployment" >}}) that
36+
replaces failed pods, the control plane (`kube-controller-manager`) creates new
37+
pods in place of the evicted pods.
38+
39+
### Self healing for static pods
40+
41+
If you are running a [static pod](/docs/concepts/workloads/pods/#static-pods)
42+
on a node that is under resource pressure, the kubelet may evict that static
43+
Pod. The kubelet then tries to create a replacement, because static Pods always
44+
represent an intent to run a Pod on that node.
45+
46+
The kubelet takes the _priority_ of the static pod into account when creating
47+
a replacement. If the static pod manifest specifies a low priority, and there
48+
are higher-priority Pods defined within the cluster's control plane, and the
49+
node is under resource pressure, the kubelet may not be able to make room for
50+
that static pod. The kubelet continues to attempt to run all static pods even
51+
when there is resource pressure on a node.
52+
53+
## Eviction signals and thresholds
3754

3855
The kubelet uses various parameters to make eviction decisions, like the following:
3956

@@ -48,7 +65,7 @@ point in time. Kubelet uses eviction signals to make eviction decisions by
4865
comparing the signals to eviction thresholds, which are the minimum amount of
4966
the resource that should be available on the node.
5067

51-
Kubelet uses the following eviction signals:
68+
On Linux, the kubelet uses the following eviction signals:
5269

5370
| Eviction Signal | Description |
5471
|----------------------|---------------------------------------------------------------------------------------|
@@ -59,7 +76,7 @@ Kubelet uses the following eviction signals:
5976
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
6077
| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` |
6178

62-
In this table, the `Description` column shows how kubelet gets the value of the
79+
In this table, the **Description** column shows how kubelet gets the value of the
6380
signal. Each signal supports either a percentage or a literal value. Kubelet
6481
calculates the percentage value relative to the total capacity associated with
6582
the signal.
@@ -71,18 +88,19 @@ feature, out of resource decisions
7188
are made local to the end user Pod part of the cgroup hierarchy as well as the
7289
root node. This [script](/examples/admin/resource/memory-available.sh)
7390
reproduces the same set of steps that the kubelet performs to calculate
74-
`memory.available`. The kubelet excludes inactive_file (i.e. # of bytes of
75-
file-backed memory on inactive LRU list) from its calculation as it assumes that
91+
`memory.available`. The kubelet excludes inactive_file (the number of bytes of
92+
file-backed memory on the inactive LRU list) from its calculation, as it assumes that
7693
memory is reclaimable under pressure.
7794

78-
The kubelet supports the following filesystem partitions:
95+
The kubelet recognizes two specific filesystem identifiers:
7996

80-
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir,
81-
log storage, and more. For example, `nodefs` contains `/var/lib/kubelet/`.
97+
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir
98+
volumes not backed by memory, log storage, and more.
99+
For example, `nodefs` contains `/var/lib/kubelet/`.
82100
1. `imagefs`: An optional filesystem that container runtimes use to store container
83101
images and container writable layers.
84102

85-
Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet
103+
Kubelet auto-discovers these filesystems and ignores other node local filesystems. Kubelet
86104
does not support other configurations.
87105

88106
Some kubelet garbage collection features are deprecated in favor of eviction:
@@ -98,7 +116,8 @@ Some kubelet garbage collection features are deprecated in favor of eviction:
98116
### Eviction thresholds
99117

100118
You can specify custom eviction thresholds for the kubelet to use when it makes
101-
eviction decisions.
119+
eviction decisions. You can configure [soft](#soft-eviction-thresholds) and
120+
[hard](#hard-eviction-thresholds) eviction thresholds.
102121

103122
Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where:
104123

@@ -109,18 +128,16 @@ Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where
109128
must match the quantity representation used by Kubernetes. You can use either
110129
literal values or percentages (`%`).
111130

112-
For example, if a node has `10Gi` of total memory and you want trigger eviction if
113-
the available memory falls below `1Gi`, you can define the eviction threshold as
114-
either `memory.available<10%` or `memory.available<1Gi`. You cannot use both.
115-
116-
You can configure soft and hard eviction thresholds.
131+
For example, if a node has 10GiB of total memory and you want trigger eviction if
132+
the available memory falls below 1GiB, you can define the eviction threshold as
133+
either `memory.available<10%` or `memory.available<1Gi` (you cannot use both).
117134

118135
#### Soft eviction thresholds {#soft-eviction-thresholds}
119136

120137
A soft eviction threshold pairs an eviction threshold with a required
121138
administrator-specified grace period. The kubelet does not evict pods until the
122-
grace period is exceeded. The kubelet returns an error on startup if there is no
123-
specified grace period.
139+
grace period is exceeded. The kubelet returns an error on startup if you do
140+
not specify a grace period.
124141

125142
You can specify both a soft eviction threshold grace period and a maximum
126143
allowed pod termination grace period for kubelet to use during evictions. If you
@@ -160,16 +177,16 @@ then the values of other parameters will not be inherited as the default
160177
values and will be set to zero. In order to provide custom values, you
161178
should provide all the thresholds respectively.
162179

163-
### Eviction monitoring interval
180+
## Eviction monitoring interval
164181

165-
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`
182+
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`,
166183
which defaults to `10s`.
167184

168-
### Node conditions {#node-conditions}
185+
## Node conditions {#node-conditions}
169186

170-
The kubelet reports node conditions to reflect that the node is under pressure
171-
because hard or soft eviction threshold is met, independent of configured grace
172-
periods.
187+
The kubelet reports [node conditions](/docs/concepts/architecture/nodes/#condition)
188+
to reflect that the node is under pressure because hard or soft eviction
189+
threshold is met, independent of configured grace periods.
173190

174191
The kubelet maps eviction signals to node conditions as follows:
175192

@@ -179,10 +196,13 @@ The kubelet maps eviction signals to node conditions as follows:
179196
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
180197
| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold |
181198

199+
The control plane also [maps](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)
200+
these node conditions to taints.
201+
182202
The kubelet updates the node conditions based on the configured
183203
`--node-status-update-frequency`, which defaults to `10s`.
184204

185-
#### Node condition oscillation
205+
### Node condition oscillation
186206

187207
In some cases, nodes oscillate above and below soft eviction thresholds without
188208
holding for the defined grace periods. This causes the reported node condition
@@ -237,9 +257,9 @@ As a result, kubelet ranks and evicts pods in the following order:
237257
are evicted last, based on their Priority.
238258

239259
{{<note>}}
240-
The kubelet does not use the pod's QoS class to determine the eviction order.
260+
The kubelet does not use the pod's [QoS class](/docs/concepts/workloads/pods/pod-qos/) to determine the eviction order.
241261
You can use the QoS class to estimate the most likely pod eviction order when
242-
reclaiming resources like memory. QoS does not apply to EphemeralStorage requests,
262+
reclaiming resources like memory. QoS classification does not apply to EphemeralStorage requests,
243263
so the above scenario will not apply if the node is, for example, under `DiskPressure`.
244264
{{</note>}}
245265

@@ -253,8 +273,13 @@ then the kubelet must choose to evict one of these pods to preserve node stabili
253273
and to limit the impact of resource starvation on other pods. In this case, it
254274
will choose to evict pods of lowest Priority first.
255275

256-
When the kubelet evicts pods in response to `inode` or `PID` starvation, it uses
257-
the Priority to determine the eviction order, because `inodes` and `PIDs` have no
276+
If you are running a [static pod](/docs/concepts/workloads/pods/#static-pods)
277+
and want to avoid having it evicted under resource pressure, set the
278+
`priority` field for that Pod directly. Static pods do not support the
279+
`priorityClassName` field.
280+
281+
When the kubelet evicts pods in response to inode or process ID starvation, it uses
282+
the Pods' relative priority to determine the eviction order, because inodes and PIDs have no
258283
requests.
259284

260285
The kubelet sorts pods differently based on whether the node has a dedicated
@@ -300,31 +325,31 @@ evictionMinimumReclaim:
300325
```
301326
302327
In this example, if the `nodefs.available` signal meets the eviction threshold,
303-
the kubelet reclaims the resource until the signal reaches the threshold of `1Gi`,
304-
and then continues to reclaim the minimum amount of `500Mi` it until the signal
305-
reaches `1.5Gi`.
328+
the kubelet reclaims the resource until the signal reaches the threshold of 1GiB,
329+
and then continues to reclaim the minimum amount of 500MiB, until the available nodefs storage value reaches 1.5GiB.
306330

307-
Similarly, the kubelet reclaims the `imagefs` resource until the `imagefs.available`
308-
signal reaches `102Gi`.
331+
Similarly, the kubelet tries to reclaim the `imagefs` resource until the `imagefs.available`
332+
value reaches `102Gi`, representing 102 GiB of available container image storage. If the amount
333+
of storage that the kubelet could reclaim is less than 2GiB, the kubelet doesn't reclaim anything.
309334

310335
The default `eviction-minimum-reclaim` is `0` for all resources.
311336

312-
### Node out of memory behavior
337+
## Node out of memory behavior
313338

314-
If the node experiences an out of memory (OOM) event prior to the kubelet
339+
If the node experiences an _out of memory_ (OOM) event prior to the kubelet
315340
being able to reclaim memory, the node depends on the [oom_killer](https://lwn.net/Articles/391222/)
316341
to respond.
317342

318343
The kubelet sets an `oom_score_adj` value for each container based on the QoS for the pod.
319344

320-
| Quality of Service | oom_score_adj |
345+
| Quality of Service | `oom_score_adj` |
321346
|--------------------|-----------------------------------------------------------------------------------|
322347
| `Guaranteed` | -997 |
323348
| `BestEffort` | 1000 |
324-
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
349+
| `Burstable` | _min(max(2, 1000 - (1000 × memoryRequestBytes) / machineMemoryCapacityBytes), 999)_ |
325350

326351
{{<note>}}
327-
The kubelet also sets an `oom_score_adj` value of `-997` for containers in Pods that have
352+
The kubelet also sets an `oom_score_adj` value of `-997` for any containers in Pods that have
328353
`system-node-critical` {{<glossary_tooltip text="Priority" term_id="pod-priority">}}.
329354
{{</note>}}
330355

@@ -336,56 +361,56 @@ for each container. It then kills the container with the highest score.
336361
This means that containers in low QoS pods that consume a large amount of memory
337362
relative to their scheduling requests are killed first.
338363

339-
Unlike pod eviction, if a container is OOM killed, the `kubelet` can restart it
340-
based on its `RestartPolicy`.
364+
Unlike pod eviction, if a container is OOM killed, the kubelet can restart it
365+
based on its `restartPolicy`.
341366

342-
### Best practices {#node-pressure-eviction-good-practices}
367+
## Good practices {#node-pressure-eviction-good-practices}
343368

344-
The following sections describe best practices for eviction configuration.
369+
The following sections describe good practice for eviction configuration.
345370

346-
#### Schedulable resources and eviction policies
371+
### Schedulable resources and eviction policies
347372

348373
When you configure the kubelet with an eviction policy, you should make sure that
349374
the scheduler will not schedule pods if they will trigger eviction because they
350375
immediately induce memory pressure.
351376

352377
Consider the following scenario:
353378

354-
- Node memory capacity: `10Gi`
379+
- Node memory capacity: 10GiB
355380
- Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
356381
- Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
357382

358383
For this to work, the kubelet is launched as follows:
359384

360-
```
385+
```none
361386
--eviction-hard=memory.available<500Mi
362387
--system-reserved=memory=1.5Gi
363388
```
364389

365-
In this configuration, the `--system-reserved` flag reserves `1.5Gi` of memory
390+
In this configuration, the `--system-reserved` flag reserves 1.5GiB of memory
366391
for the system, which is `10% of the total memory + the eviction threshold amount`.
367392

368393
The node can reach the eviction threshold if a pod is using more than its request,
369-
or if the system is using more than `1Gi` of memory, which makes the `memory.available`
370-
signal fall below `500Mi` and triggers the threshold.
394+
or if the system is using more than 1GiB of memory, which makes the `memory.available`
395+
signal fall below 500MiB and triggers the threshold.
371396

372-
#### DaemonSet
397+
### DaemonSets and node-pressure eviction {#daemonset}
373398

374-
Pod Priority is a major factor in making eviction decisions. If you do not want
375-
the kubelet to evict pods that belong to a `DaemonSet`, give those pods a high
376-
enough `priorityClass` in the pod spec. You can also use a lower `priorityClass`
377-
or the default to only allow `DaemonSet` pods to run when there are enough
378-
resources.
399+
Pod priority is a major factor in making eviction decisions. If you do not want
400+
the kubelet to evict pods that belong to a DaemonSet, give those pods a high
401+
enough priority by specifying a suitable `priorityClassName` in the pod spec.
402+
You can also use a lower priority, or the default, to only allow pods from that
403+
DaemonSet to run when there are enough resources.
379404

380-
### Known issues
405+
## Known issues
381406

382407
The following sections describe known issues related to out of resource handling.
383408

384-
#### kubelet may not observe memory pressure right away
409+
### kubelet may not observe memory pressure right away
385410

386-
By default, the kubelet polls `cAdvisor` to collect memory usage stats at a
411+
By default, the kubelet polls cAdvisor to collect memory usage stats at a
387412
regular interval. If memory usage increases within that window rapidly, the
388-
kubelet may not observe `MemoryPressure` fast enough, and the `OOMKiller`
413+
kubelet may not observe `MemoryPressure` fast enough, and the OOM killer
389414
will still be invoked.
390415

391416
You can use the `--kernel-memcg-notification` flag to enable the `memcg`
@@ -396,10 +421,10 @@ If you are not trying to achieve extreme utilization, but a sensible measure of
396421
overcommit, a viable workaround for this issue is to use the `--kube-reserved`
397422
and `--system-reserved` flags to allocate memory for the system.
398423

399-
#### active_file memory is not considered as available memory
424+
### active_file memory is not considered as available memory
400425

401426
On Linux, the kernel tracks the number of bytes of file-backed memory on active
402-
LRU list as the `active_file` statistic. The kubelet treats `active_file` memory
427+
least recently used (LRU) list as the `active_file` statistic. The kubelet treats `active_file` memory
403428
areas as not reclaimable. For workloads that make intensive use of block-backed
404429
local storage, including ephemeral local storage, kernel-level caches of file
405430
and block data means that many recently accessed cache pages are likely to be

0 commit comments

Comments
 (0)