Skip to content

Commit 43e61bd

Browse files
Tim Bannisterwindsonsea
andcommitted
Revise node-pressure eviction concept
Co-authored-by: Michael <[email protected]>
1 parent d1b4ef8 commit 43e61bd

File tree

1 file changed

+69
-66
lines changed

1 file changed

+69
-66
lines changed

content/en/docs/concepts/scheduling-eviction/node-pressure-eviction.md

Lines changed: 69 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -18,22 +18,25 @@ selected pods to `Failed`. This terminates the pods.
1818
Node-pressure eviction is not the same as
1919
[API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/).
2020

21-
The kubelet does not respect your configured `PodDisruptionBudget` or the pod's
21+
The kubelet does not respect your configured {{<glossary_tooltip term_id="pod-disruption-budget" text="PodDisruptionBudget">}}
22+
or the pod's
2223
`terminationGracePeriodSeconds`. If you use [soft eviction thresholds](#soft-eviction-thresholds),
2324
the kubelet respects your configured `eviction-max-pod-grace-period`. If you use
24-
[hard eviction thresholds](#hard-eviction-thresholds), it uses a `0s` grace period for termination.
25+
[hard eviction thresholds](#hard-eviction-thresholds), the kubelet uses a `0s` grace period (immediate shutdown) for termination.
26+
27+
## Self healing behavior
28+
29+
The kubelet attempts to [reclaim node-level resources](#reclaim-node-resources)
30+
before it terminates end-user pods. For example, it removes unused container
31+
images when disk resources are starved.
2532

2633
If the pods are managed by a {{< glossary_tooltip text="workload" term_id="workload" >}}
2734
resource (such as {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
2835
or {{< glossary_tooltip text="Deployment" term_id="deployment" >}}) that
2936
replaces failed pods, the control plane or `kube-controller-manager` creates new
3037
pods in place of the evicted pods.
3138

32-
{{<note>}}
33-
The kubelet attempts to [reclaim node-level resources](#reclaim-node-resources)
34-
before it terminates end-user pods. For example, it removes unused container
35-
images when disk resources are starved.
36-
{{</note>}}
39+
## Eviction signals and thresholds
3740

3841
The kubelet uses various parameters to make eviction decisions, like the following:
3942

@@ -48,7 +51,7 @@ point in time. Kubelet uses eviction signals to make eviction decisions by
4851
comparing the signals to eviction thresholds, which are the minimum amount of
4952
the resource that should be available on the node.
5053

51-
Kubelet uses the following eviction signals:
54+
On Linux, the kubelet uses the following eviction signals:
5255

5356
| Eviction Signal | Description |
5457
|----------------------|---------------------------------------------------------------------------------------|
@@ -59,7 +62,7 @@ Kubelet uses the following eviction signals:
5962
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
6063
| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` |
6164

62-
In this table, the `Description` column shows how kubelet gets the value of the
65+
In this table, the **Description** column shows how kubelet gets the value of the
6366
signal. Each signal supports either a percentage or a literal value. Kubelet
6467
calculates the percentage value relative to the total capacity associated with
6568
the signal.
@@ -71,18 +74,19 @@ feature, out of resource decisions
7174
are made local to the end user Pod part of the cgroup hierarchy as well as the
7275
root node. This [script](/examples/admin/resource/memory-available.sh)
7376
reproduces the same set of steps that the kubelet performs to calculate
74-
`memory.available`. The kubelet excludes inactive_file (i.e. # of bytes of
75-
file-backed memory on inactive LRU list) from its calculation as it assumes that
77+
`memory.available`. The kubelet excludes inactive_file (the number of bytes of
78+
file-backed memory on the inactive LRU list) from its calculation, as it assumes that
7679
memory is reclaimable under pressure.
7780

78-
The kubelet supports the following filesystem partitions:
81+
The kubelet recognizes two specific filesystem identifiers:
7982

80-
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir,
81-
log storage, and more. For example, `nodefs` contains `/var/lib/kubelet/`.
83+
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir
84+
volumes not backed by memory, log storage, and more.
85+
For example, `nodefs` contains `/var/lib/kubelet/`.
8286
1. `imagefs`: An optional filesystem that container runtimes use to store container
8387
images and container writable layers.
8488

85-
Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet
89+
Kubelet auto-discovers these filesystems and ignores other node local filesystems. Kubelet
8690
does not support other configurations.
8791

8892
Some kubelet garbage collection features are deprecated in favor of eviction:
@@ -98,7 +102,8 @@ Some kubelet garbage collection features are deprecated in favor of eviction:
98102
### Eviction thresholds
99103

100104
You can specify custom eviction thresholds for the kubelet to use when it makes
101-
eviction decisions.
105+
eviction decisions. You can configure [soft](#soft-eviction-thresholds) and
106+
[hard](#hard-eviction-thresholds) eviction thresholds.
102107

103108
Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where:
104109

@@ -109,18 +114,16 @@ Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where
109114
must match the quantity representation used by Kubernetes. You can use either
110115
literal values or percentages (`%`).
111116

112-
For example, if a node has `10Gi` of total memory and you want trigger eviction if
113-
the available memory falls below `1Gi`, you can define the eviction threshold as
114-
either `memory.available<10%` or `memory.available<1Gi`. You cannot use both.
115-
116-
You can configure soft and hard eviction thresholds.
117+
For example, if a node has 10GiB of total memory and you want trigger eviction if
118+
the available memory falls below 1GiB, you can define the eviction threshold as
119+
either `memory.available<10%` or `memory.available<1Gi` (you cannot use both).
117120

118121
#### Soft eviction thresholds {#soft-eviction-thresholds}
119122

120123
A soft eviction threshold pairs an eviction threshold with a required
121124
administrator-specified grace period. The kubelet does not evict pods until the
122-
grace period is exceeded. The kubelet returns an error on startup if there is no
123-
specified grace period.
125+
grace period is exceeded. The kubelet returns an error on startup if you do
126+
not specify a grace period.
124127

125128
You can specify both a soft eviction threshold grace period and a maximum
126129
allowed pod termination grace period for kubelet to use during evictions. If you
@@ -160,16 +163,16 @@ then the values of other parameters will not be inherited as the default
160163
values and will be set to zero. In order to provide custom values, you
161164
should provide all the thresholds respectively.
162165

163-
### Eviction monitoring interval
166+
## Eviction monitoring interval
164167

165-
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`
168+
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`,
166169
which defaults to `10s`.
167170

168-
### Node conditions {#node-conditions}
171+
## Node conditions {#node-conditions}
169172

170-
The kubelet reports node conditions to reflect that the node is under pressure
171-
because hard or soft eviction threshold is met, independent of configured grace
172-
periods.
173+
The kubelet reports [node conditions](/docs/concepts/architecture/nodes/#condition)
174+
to reflect that the node is under pressure because hard or soft eviction
175+
threshold is met, independent of configured grace periods.
173176

174177
The kubelet maps eviction signals to node conditions as follows:
175178

@@ -182,7 +185,7 @@ The kubelet maps eviction signals to node conditions as follows:
182185
The kubelet updates the node conditions based on the configured
183186
`--node-status-update-frequency`, which defaults to `10s`.
184187

185-
#### Node condition oscillation
188+
### Node condition oscillation
186189

187190
In some cases, nodes oscillate above and below soft eviction thresholds without
188191
holding for the defined grace periods. This causes the reported node condition
@@ -237,9 +240,9 @@ As a result, kubelet ranks and evicts pods in the following order:
237240
are evicted last, based on their Priority.
238241

239242
{{<note>}}
240-
The kubelet does not use the pod's QoS class to determine the eviction order.
243+
The kubelet does not use the pod's [QoS class](/docs/concepts/workloads/pods/pod-qos/) to determine the eviction order.
241244
You can use the QoS class to estimate the most likely pod eviction order when
242-
reclaiming resources like memory. QoS does not apply to EphemeralStorage requests,
245+
reclaiming resources like memory. QoS classification does not apply to EphemeralStorage requests,
243246
so the above scenario will not apply if the node is, for example, under `DiskPressure`.
244247
{{</note>}}
245248

@@ -253,8 +256,8 @@ then the kubelet must choose to evict one of these pods to preserve node stabili
253256
and to limit the impact of resource starvation on other pods. In this case, it
254257
will choose to evict pods of lowest Priority first.
255258

256-
When the kubelet evicts pods in response to `inode` or `PID` starvation, it uses
257-
the Priority to determine the eviction order, because `inodes` and `PIDs` have no
259+
When the kubelet evicts pods in response to inode or process ID starvation, it uses
260+
the Pods' relative priority to determine the eviction order, because inodes and PIDs have no
258261
requests.
259262

260263
The kubelet sorts pods differently based on whether the node has a dedicated
@@ -300,31 +303,31 @@ evictionMinimumReclaim:
300303
```
301304
302305
In this example, if the `nodefs.available` signal meets the eviction threshold,
303-
the kubelet reclaims the resource until the signal reaches the threshold of `1Gi`,
304-
and then continues to reclaim the minimum amount of `500Mi` it until the signal
305-
reaches `1.5Gi`.
306+
the kubelet reclaims the resource until the signal reaches the threshold of 1GiB,
307+
and then continues to reclaim the minimum amount of 500MiB, until the available nodefs storage value reaches 1.5GiB.
306308

307-
Similarly, the kubelet reclaims the `imagefs` resource until the `imagefs.available`
308-
signal reaches `102Gi`.
309+
Similarly, the kubelet tries to reclaim the `imagefs` resource until the `imagefs.available`
310+
value reaches `102Gi`, representing 102 GiB of available container image storage. If the amount
311+
of storage that the kubelet could reclaim is less than 2GiB, the kubelet doesn't reclaim anything.
309312

310313
The default `eviction-minimum-reclaim` is `0` for all resources.
311314

312-
### Node out of memory behavior
315+
## Node out of memory behavior
313316

314-
If the node experiences an out of memory (OOM) event prior to the kubelet
317+
If the node experiences an _out of memory_ (OOM) event prior to the kubelet
315318
being able to reclaim memory, the node depends on the [oom_killer](https://lwn.net/Articles/391222/)
316319
to respond.
317320

318321
The kubelet sets an `oom_score_adj` value for each container based on the QoS for the pod.
319322

320-
| Quality of Service | oom_score_adj |
323+
| Quality of Service | `oom_score_adj` |
321324
|--------------------|-----------------------------------------------------------------------------------|
322325
| `Guaranteed` | -997 |
323326
| `BestEffort` | 1000 |
324-
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
327+
| `Burstable` | _min(max(2, 1000 - (1000 × memoryRequestBytes) / machineMemoryCapacityBytes), 999)_ |
325328

326329
{{<note>}}
327-
The kubelet also sets an `oom_score_adj` value of `-997` for containers in Pods that have
330+
The kubelet also sets an `oom_score_adj` value of `-997` for any containers in Pods that have
328331
`system-node-critical` {{<glossary_tooltip text="Priority" term_id="pod-priority">}}.
329332
{{</note>}}
330333

@@ -336,56 +339,56 @@ for each container. It then kills the container with the highest score.
336339
This means that containers in low QoS pods that consume a large amount of memory
337340
relative to their scheduling requests are killed first.
338341

339-
Unlike pod eviction, if a container is OOM killed, the `kubelet` can restart it
340-
based on its `RestartPolicy`.
342+
Unlike pod eviction, if a container is OOM killed, the kubelet can restart it
343+
based on its `restartPolicy`.
341344

342-
### Best practices {#node-pressure-eviction-good-practices}
345+
## Good practices {#node-pressure-eviction-good-practices}
343346

344-
The following sections describe best practices for eviction configuration.
347+
The following sections describe good practice for eviction configuration.
345348

346-
#### Schedulable resources and eviction policies
349+
### Schedulable resources and eviction policies
347350

348351
When you configure the kubelet with an eviction policy, you should make sure that
349352
the scheduler will not schedule pods if they will trigger eviction because they
350353
immediately induce memory pressure.
351354

352355
Consider the following scenario:
353356

354-
- Node memory capacity: `10Gi`
357+
- Node memory capacity: 10GiB
355358
- Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
356359
- Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
357360

358361
For this to work, the kubelet is launched as follows:
359362

360-
```
363+
```none
361364
--eviction-hard=memory.available<500Mi
362365
--system-reserved=memory=1.5Gi
363366
```
364367

365-
In this configuration, the `--system-reserved` flag reserves `1.5Gi` of memory
368+
In this configuration, the `--system-reserved` flag reserves 1.5GiB of memory
366369
for the system, which is `10% of the total memory + the eviction threshold amount`.
367370

368371
The node can reach the eviction threshold if a pod is using more than its request,
369-
or if the system is using more than `1Gi` of memory, which makes the `memory.available`
370-
signal fall below `500Mi` and triggers the threshold.
372+
or if the system is using more than 1GiB of memory, which makes the `memory.available`
373+
signal fall below 500MiB and triggers the threshold.
371374

372-
#### DaemonSet
375+
### DaemonSets and node-pressure eviction {#daemonset}
373376

374-
Pod Priority is a major factor in making eviction decisions. If you do not want
375-
the kubelet to evict pods that belong to a `DaemonSet`, give those pods a high
376-
enough `priorityClass` in the pod spec. You can also use a lower `priorityClass`
377-
or the default to only allow `DaemonSet` pods to run when there are enough
378-
resources.
377+
Pod priority is a major factor in making eviction decisions. If you do not want
378+
the kubelet to evict pods that belong to a DaemonSet, give those pods a high
379+
enough priority by specifying a suitable `priorityClassName` in the pod spec.
380+
You can also use a lower priority, or the default, to only allow pods from that
381+
DaemonSet to run when there are enough resources.
379382

380-
### Known issues
383+
## Known issues
381384

382385
The following sections describe known issues related to out of resource handling.
383386

384-
#### kubelet may not observe memory pressure right away
387+
### kubelet may not observe memory pressure right away
385388

386-
By default, the kubelet polls `cAdvisor` to collect memory usage stats at a
389+
By default, the kubelet polls cAdvisor to collect memory usage stats at a
387390
regular interval. If memory usage increases within that window rapidly, the
388-
kubelet may not observe `MemoryPressure` fast enough, and the `OOMKiller`
391+
kubelet may not observe `MemoryPressure` fast enough, and the OOM killer
389392
will still be invoked.
390393

391394
You can use the `--kernel-memcg-notification` flag to enable the `memcg`
@@ -396,10 +399,10 @@ If you are not trying to achieve extreme utilization, but a sensible measure of
396399
overcommit, a viable workaround for this issue is to use the `--kube-reserved`
397400
and `--system-reserved` flags to allocate memory for the system.
398401

399-
#### active_file memory is not considered as available memory
402+
### active_file memory is not considered as available memory
400403

401404
On Linux, the kernel tracks the number of bytes of file-backed memory on active
402-
LRU list as the `active_file` statistic. The kubelet treats `active_file` memory
405+
least recently used (LRU) list as the `active_file` statistic. The kubelet treats `active_file` memory
403406
areas as not reclaimable. For workloads that make intensive use of block-backed
404407
local storage, including ephemeral local storage, kernel-level caches of file
405408
and block data means that many recently accessed cache pages are likely to be

0 commit comments

Comments
 (0)