You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This page is an overview of Kubernetes' policy for eviction.
10
+
11
+
<!-- body -->
12
+
13
+
## Eviction Policy
14
+
15
+
The {{< glossary_tooltip text="Kubelet" term_id="kubelet" >}} can proactively monitor for and prevent total starvation of a
16
+
compute resource. In those cases, the `kubelet` can reclaim the starved
17
+
resource by proactively failing one or more Pods. When the `kubelet` fails
18
+
a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
19
+
If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
20
+
to be scheduled by Kubernetes.
21
+
22
+
## {{% heading "whatsnext" %}}
23
+
- Read [Configure out of resource handling](/docs/tasks/administer-cluster/out-of-resource/) to learn more about eviction signals, thresholds, and handling.
Copy file name to clipboardExpand all lines: content/en/docs/concepts/scheduling-eviction/kube-scheduler.md
+4-7Lines changed: 4 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ kube-scheduler is designed so that, if you want and need to, you can
33
33
write your own scheduling component and use that instead.
34
34
35
35
For every newly created pod or other unscheduled pods, kube-scheduler
36
-
selects an optimal node for them to run on. However, every container in
36
+
selects an optimal node for them to run on. However, every container in
37
37
pods has different requirements for resources and every pod also has
38
38
different requirements. Therefore, existing nodes need to be filtered
39
39
according to the specific scheduling requirements.
@@ -77,12 +77,9 @@ one of these at random.
77
77
There are two supported ways to configure the filtering and scoring behavior
78
78
of the scheduler:
79
79
80
-
1.[Scheduling Policies](/docs/reference/scheduling/policies) allow you to
81
-
configure _Predicates_ for filtering and _Priorities_ for scoring.
82
-
1.[Scheduling Profiles](/docs/reference/scheduling/config/#profiles) allow you
83
-
to configure Plugins that implement different scheduling stages, including:
84
-
`QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You
85
-
can also configure the kube-scheduler to run different profiles.
80
+
81
+
1.[Scheduling Policies](/docs/reference/scheduling/policies) allow you to configure _Predicates_ for filtering and _Priorities_ for scoring.
82
+
1.[Scheduling Profiles](/docs/reference/scheduling/profiles) allow you to configure Plugins that implement different scheduling stages, including: `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You can also configure the kube-scheduler to run different profiles.
*`eviction-signal` is an eviction signal token as defined in the previous table.
85
76
*`operator` is the desired relational operator, such as `<` (less than).
86
-
*`quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must
87
-
match the quantity representation used by Kubernetes. An eviction threshold can also
88
-
be expressed as a percentage using the `%` token.
77
+
*`quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must match the quantity representation used by Kubernetes. An eviction threshold can also be expressed as a percentage using the `%` token.
89
78
90
79
For example, if a node has `10Gi` of total memory and you want trigger eviction if
91
80
the available memory falls below `1Gi`, you can define the eviction threshold as
@@ -108,12 +97,9 @@ termination.
108
97
109
98
To configure soft eviction thresholds, the following flags are supported:
110
99
111
-
*`eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
112
-
corresponding grace period would trigger a Pod eviction.
113
-
*`eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
114
-
correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
115
-
*`eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
116
-
pods in response to a soft eviction threshold being met.
100
+
*`eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a corresponding grace period would trigger a Pod eviction.
101
+
*`eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
102
+
*`eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
117
103
118
104
#### Hard Eviction Thresholds
119
105
@@ -124,8 +110,7 @@ with no graceful termination.
124
110
125
111
To configure hard eviction thresholds, the following flag is supported:
126
112
127
-
*`eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
128
-
would trigger a Pod eviction.
113
+
*`eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met would trigger a Pod eviction.
129
114
130
115
The `kubelet` has the following default hard eviction threshold:
131
116
@@ -150,10 +135,10 @@ reflects the node is under pressure.
150
135
151
136
The following node conditions are defined that correspond to the specified eviction signal.
152
137
153
-
| Node Condition | Eviction Signal | Description |
|`MemoryPressure`|`memory.available`| Available memory on the node has satisfied an eviction threshold |
156
-
|`DiskPressure`|`nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree`| Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
|`MemoryPressure`|`memory.available`| Available memory on the node has satisfied an eviction threshold|
141
+
|`DiskPressure`|`nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree`| Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
157
142
158
143
The `kubelet` continues to report node status updates at the frequency specified by
159
144
`--node-status-update-frequency` which defaults to `10s`.
@@ -168,8 +153,7 @@ as a consequence.
168
153
To protect against this oscillation, the following flag is defined to control how
169
154
long the `kubelet` must wait before transitioning out of a pressure condition.
170
155
171
-
*`eviction-pressure-transition-period` is the duration for which the `kubelet` has
172
-
to wait before transitioning out of an eviction pressure condition.
156
+
*`eviction-pressure-transition-period` is the duration for which the `kubelet` has to wait before transitioning out of an eviction pressure condition.
173
157
174
158
The `kubelet` would ensure that it has not observed an eviction threshold being met
175
159
for the specified pressure condition for the period specified before toggling the
@@ -207,17 +191,8 @@ then by [Priority](/docs/concepts/configuration/pod-priority-preemption/), and t
207
191
208
192
As a result, `kubelet` ranks and evicts Pods in the following order:
209
193
210
-
*`BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request.
211
-
Such pods are ranked by Priority, and then usage above request.
212
-
*`Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last.
213
-
`Guaranteed` Pods are guaranteed only when requests and limits are specified for all
214
-
the containers and they are equal. Such pods are guaranteed to never be evicted because
215
-
of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`,
216
-
and `journald`) is consuming more resources than were reserved via `system-reserved` or
217
-
`kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using
218
-
less than requests remaining, then the node must choose to evict such a Pod in order to
219
-
preserve node stability and to limit the impact of the unexpected consumption to other Pods.
220
-
In this case, it will choose to evict pods of Lowest Priority first.
194
+
*`BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request. Such pods are ranked by Priority, and then usage above request.
195
+
*`Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last. `Guaranteed` Pods are guaranteed only when requests and limits are specified for all the containers and they are equal. Such pods are guaranteed to never be evicted because of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`, and `journald`) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using less than requests remaining, then the node must choose to evict such a Pod in order to preserve node stability and to limit the impact of the unexpected consumption to other Pods. In this case, it will choose to evict pods of Lowest Priority first.
221
196
222
197
If necessary, `kubelet` evicts Pods one at a time to reclaim disk when `DiskPressure`
223
198
is encountered. If the `kubelet` is responding to `inode` starvation, it reclaims
@@ -228,20 +203,21 @@ that consumes the largest amount of disk and kills those first.
228
203
#### With `imagefs`
229
204
230
205
If `nodefs` is triggering evictions, `kubelet` sorts Pods based on the usage on `nodefs`
206
+
231
207
- local volumes + logs of all its containers.
232
208
233
209
If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable layer usage of all its containers.
234
210
235
211
#### Without `imagefs`
236
212
237
213
If `nodefs` is triggering evictions, `kubelet` sorts Pods based on their total disk usage
214
+
238
215
- local volumes + logs & writable layer of all its containers.
239
216
240
217
### Minimum eviction reclaim
241
218
242
219
In certain scenarios, eviction of Pods could result in reclamation of small amount of resources. This can result in
243
-
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
244
-
is time consuming.
220
+
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, is time consuming.
245
221
246
222
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
247
223
resource pressure, `kubelet` attempts to reclaim at least `minimum-reclaim` amount of resource below
@@ -268,10 +244,10 @@ The node reports a condition when a compute resource is under pressure. The
268
244
scheduler views that condition as a signal to dissuade placing additional
If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` calculates
290
266
an `oom_score` based on the percentage of memory it's using on the node, and then add the `oom_score_adj` to get an
@@ -325,10 +301,7 @@ and trigger eviction assuming those Pods use less than their configured request.
325
301
326
302
### DaemonSet
327
303
328
-
As `Priority` is a key factor in the eviction strategy, if you do not want
329
-
pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass
330
-
in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if
331
-
there are sufficient resources, specify a lower or default priorityClass.
304
+
As `Priority` is a key factor in the eviction strategy, if you do not want pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if there are sufficient resources, specify a lower or default priorityClass.
332
305
333
306
334
307
## Deprecation of existing feature flags to reclaim disk
@@ -338,15 +311,15 @@ there are sufficient resources, specify a lower or default priorityClass.
338
311
As disk based eviction matures, the following `kubelet` flags are marked for deprecation
339
312
in favor of the simpler configuration supported around eviction.
340
313
341
-
| Existing Flag | New Flag |
342
-
| ------------- | --------|
343
-
|`--image-gc-high-threshold`|`--eviction-hard` or `eviction-soft`|
0 commit comments