@@ -18,22 +18,25 @@ selected pods to `Failed`. This terminates the pods.
18
18
Node-pressure eviction is not the same as
19
19
[ API-initiated eviction] ( /docs/concepts/scheduling-eviction/api-eviction/ ) .
20
20
21
- The kubelet does not respect your configured ` PodDisruptionBudget ` or the pod's
21
+ The kubelet does not respect your configured {{<glossary_tooltip term_id="pod-disruption-budget" text="PodDisruptionBudget">}}
22
+ or the pod's
22
23
` terminationGracePeriodSeconds ` . If you use [ soft eviction thresholds] ( #soft-eviction-thresholds ) ,
23
24
the kubelet respects your configured ` eviction-max-pod-grace-period ` . If you use
24
- [ hard eviction thresholds] ( #hard-eviction-thresholds ) , it uses a ` 0s ` grace period for termination.
25
+ [ hard eviction thresholds] ( #hard-eviction-thresholds ) , the kubelet uses a ` 0s ` grace period (immediate shutdown) for termination.
26
+
27
+ ## Self healing behavior
28
+
29
+ The kubelet attempts to [ reclaim node-level resources] ( #reclaim-node-resources )
30
+ before it terminates end-user pods. For example, it removes unused container
31
+ images when disk resources are starved.
25
32
26
33
If the pods are managed by a {{< glossary_tooltip text="workload" term_id="workload" >}}
27
34
resource (such as {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
28
35
or {{< glossary_tooltip text="Deployment" term_id="deployment" >}}) that
29
36
replaces failed pods, the control plane or ` kube-controller-manager ` creates new
30
37
pods in place of the evicted pods.
31
38
32
- {{<note >}}
33
- The kubelet attempts to [ reclaim node-level resources] ( #reclaim-node-resources )
34
- before it terminates end-user pods. For example, it removes unused container
35
- images when disk resources are starved.
36
- {{</note >}}
39
+ ## Eviction signals and thresholds
37
40
38
41
The kubelet uses various parameters to make eviction decisions, like the following:
39
42
@@ -48,7 +51,7 @@ point in time. Kubelet uses eviction signals to make eviction decisions by
48
51
comparing the signals to eviction thresholds, which are the minimum amount of
49
52
the resource that should be available on the node.
50
53
51
- Kubelet uses the following eviction signals:
54
+ On Linux, the kubelet uses the following eviction signals:
52
55
53
56
| Eviction Signal | Description |
54
57
| ----------------------| ---------------------------------------------------------------------------------------|
@@ -59,7 +62,7 @@ Kubelet uses the following eviction signals:
59
62
| ` imagefs.inodesFree ` | ` imagefs.inodesFree ` := ` node.stats.runtime.imagefs.inodesFree ` |
60
63
| ` pid.available ` | ` pid.available ` := ` node.stats.rlimit.maxpid ` - ` node.stats.rlimit.curproc ` |
61
64
62
- In this table, the ` Description ` column shows how kubelet gets the value of the
65
+ In this table, the ** Description** column shows how kubelet gets the value of the
63
66
signal. Each signal supports either a percentage or a literal value. Kubelet
64
67
calculates the percentage value relative to the total capacity associated with
65
68
the signal.
@@ -71,18 +74,19 @@ feature, out of resource decisions
71
74
are made local to the end user Pod part of the cgroup hierarchy as well as the
72
75
root node. This [ script] ( /examples/admin/resource/memory-available.sh )
73
76
reproduces the same set of steps that the kubelet performs to calculate
74
- ` memory.available ` . The kubelet excludes inactive_file (i.e. # of bytes of
75
- file-backed memory on inactive LRU list) from its calculation as it assumes that
77
+ ` memory.available ` . The kubelet excludes inactive_file (the number of bytes of
78
+ file-backed memory on the inactive LRU list) from its calculation, as it assumes that
76
79
memory is reclaimable under pressure.
77
80
78
- The kubelet supports the following filesystem partitions :
81
+ The kubelet recognizes two specific filesystem identifiers :
79
82
80
- 1 . ` nodefs ` : The node's main filesystem, used for local disk volumes, emptyDir,
81
- log storage, and more. For example, ` nodefs ` contains ` /var/lib/kubelet/ ` .
83
+ 1 . ` nodefs ` : The node's main filesystem, used for local disk volumes, emptyDir
84
+ volumes not backed by memory, log storage, and more.
85
+ For example, ` nodefs ` contains ` /var/lib/kubelet/ ` .
82
86
1 . ` imagefs ` : An optional filesystem that container runtimes use to store container
83
87
images and container writable layers.
84
88
85
- Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet
89
+ Kubelet auto-discovers these filesystems and ignores other node local filesystems. Kubelet
86
90
does not support other configurations.
87
91
88
92
Some kubelet garbage collection features are deprecated in favor of eviction:
@@ -98,7 +102,8 @@ Some kubelet garbage collection features are deprecated in favor of eviction:
98
102
### Eviction thresholds
99
103
100
104
You can specify custom eviction thresholds for the kubelet to use when it makes
101
- eviction decisions.
105
+ eviction decisions. You can configure [ soft] ( #soft-eviction-thresholds ) and
106
+ [ hard] ( #hard-eviction-thresholds ) eviction thresholds.
102
107
103
108
Eviction thresholds have the form ` [eviction-signal][operator][quantity] ` , where:
104
109
@@ -109,18 +114,16 @@ Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where
109
114
must match the quantity representation used by Kubernetes. You can use either
110
115
literal values or percentages (` % ` ).
111
116
112
- For example, if a node has ` 10Gi ` of total memory and you want trigger eviction if
113
- the available memory falls below ` 1Gi ` , you can define the eviction threshold as
114
- either ` memory.available<10% ` or ` memory.available<1Gi ` . You cannot use both.
115
-
116
- You can configure soft and hard eviction thresholds.
117
+ For example, if a node has 10GiB of total memory and you want trigger eviction if
118
+ the available memory falls below 1GiB, you can define the eviction threshold as
119
+ either ` memory.available<10% ` or ` memory.available<1Gi ` (you cannot use both).
117
120
118
121
#### Soft eviction thresholds {#soft-eviction-thresholds}
119
122
120
123
A soft eviction threshold pairs an eviction threshold with a required
121
124
administrator-specified grace period. The kubelet does not evict pods until the
122
- grace period is exceeded. The kubelet returns an error on startup if there is no
123
- specified grace period.
125
+ grace period is exceeded. The kubelet returns an error on startup if you do
126
+ not specify a grace period.
124
127
125
128
You can specify both a soft eviction threshold grace period and a maximum
126
129
allowed pod termination grace period for kubelet to use during evictions. If you
@@ -160,16 +163,16 @@ then the values of other parameters will not be inherited as the default
160
163
values and will be set to zero. In order to provide custom values, you
161
164
should provide all the thresholds respectively.
162
165
163
- ### Eviction monitoring interval
166
+ ## Eviction monitoring interval
164
167
165
- The kubelet evaluates eviction thresholds based on its configured ` housekeeping-interval `
168
+ The kubelet evaluates eviction thresholds based on its configured ` housekeeping-interval ` ,
166
169
which defaults to ` 10s ` .
167
170
168
- ### Node conditions {#node-conditions}
171
+ ## Node conditions {#node-conditions}
169
172
170
- The kubelet reports node conditions to reflect that the node is under pressure
171
- because hard or soft eviction threshold is met, independent of configured grace
172
- periods.
173
+ The kubelet reports [ node conditions] ( /docs/concepts/architecture/nodes/#condition )
174
+ to reflect that the node is under pressure because hard or soft eviction
175
+ threshold is met, independent of configured grace periods.
173
176
174
177
The kubelet maps eviction signals to node conditions as follows:
175
178
@@ -182,7 +185,7 @@ The kubelet maps eviction signals to node conditions as follows:
182
185
The kubelet updates the node conditions based on the configured
183
186
` --node-status-update-frequency ` , which defaults to ` 10s ` .
184
187
185
- #### Node condition oscillation
188
+ ### Node condition oscillation
186
189
187
190
In some cases, nodes oscillate above and below soft eviction thresholds without
188
191
holding for the defined grace periods. This causes the reported node condition
@@ -237,9 +240,9 @@ As a result, kubelet ranks and evicts pods in the following order:
237
240
are evicted last, based on their Priority.
238
241
239
242
{{<note >}}
240
- The kubelet does not use the pod's QoS class to determine the eviction order.
243
+ The kubelet does not use the pod's [ QoS class] ( /docs/concepts/workloads/pods/pod-qos/ ) to determine the eviction order.
241
244
You can use the QoS class to estimate the most likely pod eviction order when
242
- reclaiming resources like memory. QoS does not apply to EphemeralStorage requests,
245
+ reclaiming resources like memory. QoS classification does not apply to EphemeralStorage requests,
243
246
so the above scenario will not apply if the node is, for example, under ` DiskPressure ` .
244
247
{{</note >}}
245
248
@@ -253,8 +256,8 @@ then the kubelet must choose to evict one of these pods to preserve node stabili
253
256
and to limit the impact of resource starvation on other pods. In this case, it
254
257
will choose to evict pods of lowest Priority first.
255
258
256
- When the kubelet evicts pods in response to ` inode ` or ` PID ` starvation, it uses
257
- the Priority to determine the eviction order, because ` inodes ` and ` PIDs ` have no
259
+ When the kubelet evicts pods in response to inode or process ID starvation, it uses
260
+ the Pods' relative priority to determine the eviction order, because inodes and PIDs have no
258
261
requests.
259
262
260
263
The kubelet sorts pods differently based on whether the node has a dedicated
@@ -300,31 +303,31 @@ evictionMinimumReclaim:
300
303
` ` `
301
304
302
305
In this example, if the ` nodefs.available` signal meets the eviction threshold,
303
- the kubelet reclaims the resource until the signal reaches the threshold of `1Gi`,
304
- and then continues to reclaim the minimum amount of `500Mi` it until the signal
305
- reaches `1.5Gi`.
306
+ the kubelet reclaims the resource until the signal reaches the threshold of 1GiB,
307
+ and then continues to reclaim the minimum amount of 500MiB, until the available nodefs storage value reaches 1.5GiB.
306
308
307
- Similarly, the kubelet reclaims the `imagefs` resource until the `imagefs.available`
308
- signal reaches `102Gi`.
309
+ Similarly, the kubelet tries to reclaim the `imagefs` resource until the `imagefs.available`
310
+ value reaches `102Gi`, representing 102 GiB of available container image storage. If the amount
311
+ of storage that the kubelet could reclaim is less than 2GiB, the kubelet doesn't reclaim anything.
309
312
310
313
The default `eviction-minimum-reclaim` is `0` for all resources.
311
314
312
- # ## Node out of memory behavior
315
+ # # Node out of memory behavior
313
316
314
- If the node experiences an out of memory (OOM) event prior to the kubelet
317
+ If the node experiences an _out of memory_ (OOM) event prior to the kubelet
315
318
being able to reclaim memory, the node depends on the [oom_killer](https://lwn.net/Articles/391222/)
316
319
to respond.
317
320
318
321
The kubelet sets an `oom_score_adj` value for each container based on the QoS for the pod.
319
322
320
- | Quality of Service | oom_score_adj |
323
+ | Quality of Service | ` oom_score_adj` |
321
324
|--------------------|-----------------------------------------------------------------------------------|
322
325
| `Guaranteed` | -997 |
323
326
| `BestEffort` | 1000 |
324
- | `Burstable` | min (max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
327
+ | `Burstable` | _min (max(2, 1000 - (1000 × memoryRequestBytes) / machineMemoryCapacityBytes), 999)_ |
325
328
326
329
{{<note>}}
327
- The kubelet also sets an `oom_score_adj` value of `-997` for containers in Pods that have
330
+ The kubelet also sets an `oom_score_adj` value of `-997` for any containers in Pods that have
328
331
` system-node-critical` {{<glossary_tooltip text="Priority" term_id="pod-priority">}}.
329
332
{{</note>}}
330
333
@@ -336,56 +339,56 @@ for each container. It then kills the container with the highest score.
336
339
This means that containers in low QoS pods that consume a large amount of memory
337
340
relative to their scheduling requests are killed first.
338
341
339
- Unlike pod eviction, if a container is OOM killed, the ` kubelet` can restart it
340
- based on its `RestartPolicy `.
342
+ Unlike pod eviction, if a container is OOM killed, the kubelet can restart it
343
+ based on its `restartPolicy `.
341
344
342
- # ## Best practices {#node-pressure-eviction-good-practices}
345
+ # # Good practices {#node-pressure-eviction-good-practices}
343
346
344
- The following sections describe best practices for eviction configuration.
347
+ The following sections describe good practice for eviction configuration.
345
348
346
- # ### Schedulable resources and eviction policies
349
+ # ## Schedulable resources and eviction policies
347
350
348
351
When you configure the kubelet with an eviction policy, you should make sure that
349
352
the scheduler will not schedule pods if they will trigger eviction because they
350
353
immediately induce memory pressure.
351
354
352
355
Consider the following scenario :
353
356
354
- - Node memory capacity : ` 10Gi `
357
+ - Node memory capacity : 10GiB
355
358
- Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
356
359
- Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
357
360
358
361
For this to work, the kubelet is launched as follows :
359
362
360
- ` ` `
363
+ ` ` ` none
361
364
--eviction-hard=memory.available<500Mi
362
365
--system-reserved=memory=1.5Gi
363
366
` ` `
364
367
365
- In this configuration, the `--system-reserved` flag reserves `1.5Gi` of memory
368
+ In this configuration, the `--system-reserved` flag reserves 1.5GiB of memory
366
369
for the system, which is `10% of the total memory + the eviction threshold amount`.
367
370
368
371
The node can reach the eviction threshold if a pod is using more than its request,
369
- or if the system is using more than `1Gi` of memory, which makes the `memory.available`
370
- signal fall below `500Mi` and triggers the threshold.
372
+ or if the system is using more than 1GiB of memory, which makes the `memory.available`
373
+ signal fall below 500MiB and triggers the threshold.
371
374
372
- # ### DaemonSet
375
+ # ## DaemonSets and node-pressure eviction {#daemonset}
373
376
374
- Pod Priority is a major factor in making eviction decisions. If you do not want
375
- the kubelet to evict pods that belong to a ` DaemonSet` , give those pods a high
376
- enough `priorityClass ` in the pod spec. You can also use a lower `priorityClass`
377
- or the default to only allow `DaemonSet` pods to run when there are enough
378
- resources.
377
+ Pod priority is a major factor in making eviction decisions. If you do not want
378
+ the kubelet to evict pods that belong to a DaemonSet, give those pods a high
379
+ enough priority by specifying a suitable `priorityClassName ` in the pod spec.
380
+ You can also use a lower priority, or the default, to only allow pods from that
381
+ DaemonSet to run when there are enough resources.
379
382
380
- # ## Known issues
383
+ # # Known issues
381
384
382
385
The following sections describe known issues related to out of resource handling.
383
386
384
- # ### kubelet may not observe memory pressure right away
387
+ # ## kubelet may not observe memory pressure right away
385
388
386
- By default, the kubelet polls ` cAdvisor` to collect memory usage stats at a
389
+ By default, the kubelet polls cAdvisor to collect memory usage stats at a
387
390
regular interval. If memory usage increases within that window rapidly, the
388
- kubelet may not observe `MemoryPressure` fast enough, and the `OOMKiller`
391
+ kubelet may not observe `MemoryPressure` fast enough, and the OOM killer
389
392
will still be invoked.
390
393
391
394
You can use the `--kernel-memcg-notification` flag to enable the `memcg`
@@ -396,10 +399,10 @@ If you are not trying to achieve extreme utilization, but a sensible measure of
396
399
overcommit, a viable workaround for this issue is to use the `--kube-reserved`
397
400
and `--system-reserved` flags to allocate memory for the system.
398
401
399
- # ### active_file memory is not considered as available memory
402
+ # ## active_file memory is not considered as available memory
400
403
401
404
On Linux, the kernel tracks the number of bytes of file-backed memory on active
402
- LRU list as the `active_file` statistic. The kubelet treats `active_file` memory
405
+ least recently used ( LRU) list as the `active_file` statistic. The kubelet treats `active_file` memory
403
406
areas as not reclaimable. For workloads that make intensive use of block-backed
404
407
local storage, including ephemeral local storage, kernel-level caches of file
405
408
and block data means that many recently accessed cache pages are likely to be
0 commit comments