Skip to content

Commit b330bb0

Browse files
gm7y8Tim Bannister
authored andcommitted
Move eviction-policy from tasks to concepts
add what's next to eviction policy
1 parent f0a32c7 commit b330bb0

File tree

5 files changed

+74
-78
lines changed

5 files changed

+74
-78
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Eviction Policy
3+
content_template: templates/concept
4+
weight: 60
5+
---
6+
7+
<!-- overview -->
8+
9+
This page is an overview of Kubernetes' policy for eviction.
10+
11+
<!-- body -->
12+
13+
## Eviction Policy
14+
15+
The {{< glossary_tooltip text="Kubelet" term_id="kubelet" >}} can proactively monitor for and prevent total starvation of a
16+
compute resource. In those cases, the `kubelet` can reclaim the starved
17+
resource by proactively failing one or more Pods. When the `kubelet` fails
18+
a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
19+
If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
20+
to be scheduled by Kubernetes.
21+
22+
## {{% heading "whatsnext" %}}
23+
- Read [Configure out of resource handling](/docs/tasks/administer-cluster/out-of-resource/) to learn more about eviction signals, thresholds, and handling.

content/en/docs/concepts/scheduling-eviction/kube-scheduler.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ kube-scheduler is designed so that, if you want and need to, you can
3333
write your own scheduling component and use that instead.
3434

3535
For every newly created pod or other unscheduled pods, kube-scheduler
36-
selects an optimal node for them to run on. However, every container in
36+
selects an optimal node for them to run on. However, every container in
3737
pods has different requirements for resources and every pod also has
3838
different requirements. Therefore, existing nodes need to be filtered
3939
according to the specific scheduling requirements.
@@ -77,12 +77,9 @@ one of these at random.
7777
There are two supported ways to configure the filtering and scoring behavior
7878
of the scheduler:
7979

80-
1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to
81-
configure _Predicates_ for filtering and _Priorities_ for scoring.
82-
1. [Scheduling Profiles](/docs/reference/scheduling/config/#profiles) allow you
83-
to configure Plugins that implement different scheduling stages, including:
84-
`QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You
85-
can also configure the kube-scheduler to run different profiles.
80+
81+
1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to configure _Predicates_ for filtering and _Priorities_ for scoring.
82+
1. [Scheduling Profiles](/docs/reference/scheduling/profiles) allow you to configure Plugins that implement different scheduling stages, including: `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You can also configure the kube-scheduler to run different profiles.
8683

8784

8885
## {{% heading "whatsnext" %}}

content/en/docs/concepts/scheduling-eviction/scheduler-perf-tuning.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ reviewers:
33
- bsalamat
44
title: Scheduler Performance Tuning
55
content_type: concept
6-
weight: 70
6+
weight: 80
77
---
88

99
<!-- overview -->
@@ -48,10 +48,13 @@ To change the value, edit the kube-scheduler configuration file (this is likely
4848
to be `/etc/kubernetes/config/kube-scheduler.yaml`), then restart the scheduler.
4949

5050
After you have made this change, you can run
51+
5152
```bash
5253
kubectl get componentstatuses
5354
```
55+
5456
to verify that the kube-scheduler component is healthy. The output is similar to:
57+
5558
```
5659
NAME STATUS MESSAGE ERROR
5760
controller-manager Healthy ok

content/en/docs/concepts/scheduling-eviction/scheduling-framework.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ reviewers:
33
- ahg-g
44
title: Scheduling Framework
55
content_type: concept
6-
weight: 60
6+
weight: 70
77
---
88

99
<!-- overview -->

content/en/docs/tasks/administer-cluster/out-of-resource.md

Lines changed: 42 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -18,28 +18,19 @@ nodes become unstable.
1818

1919
<!-- body -->
2020

21-
## Eviction Policy
22-
23-
The `kubelet` can proactively monitor for and prevent total starvation of a
24-
compute resource. In those cases, the `kubelet` can reclaim the starved
25-
resource by proactively failing one or more Pods. When the `kubelet` fails
26-
a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
27-
If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
28-
to be scheduled by Kubernetes.
29-
3021
### Eviction Signals
3122

3223
The `kubelet` supports eviction decisions based on the signals described in the following
3324
table. The value of each signal is described in the Description column, which is based on
3425
the `kubelet` summary API.
3526

36-
| Eviction Signal | Description |
37-
|----------------------------|-----------------------------------------------------------------------|
38-
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
39-
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
40-
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
41-
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
42-
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
27+
| Eviction Signal | Description |
28+
|----------------------|---------------------------------------------------------------------------------------|
29+
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
30+
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
31+
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
32+
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
33+
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
4334

4435
Each of the above signals supports either a literal or percentage based value.
4536
The percentage based value is calculated relative to the total capacity
@@ -65,7 +56,7 @@ memory is reclaimable under pressure.
6556
`imagefs` is optional. `kubelet` auto-discovers these filesystems using
6657
cAdvisor. `kubelet` does not care about any other filesystems. Any other types
6758
of configurations are not currently supported by the kubelet. For example, it is
68-
*not OK* to store volumes and logs in a dedicated `filesystem`.
59+
_not OK_ to store volumes and logs in a dedicated `filesystem`.
6960

7061
In future releases, the `kubelet` will deprecate the existing [garbage
7162
collection](/docs/concepts/cluster-administration/kubelet-garbage-collection/)
@@ -83,9 +74,7 @@ where:
8374

8475
* `eviction-signal` is an eviction signal token as defined in the previous table.
8576
* `operator` is the desired relational operator, such as `<` (less than).
86-
* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must
87-
match the quantity representation used by Kubernetes. An eviction threshold can also
88-
be expressed as a percentage using the `%` token.
77+
* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must match the quantity representation used by Kubernetes. An eviction threshold can also be expressed as a percentage using the `%` token.
8978

9079
For example, if a node has `10Gi` of total memory and you want trigger eviction if
9180
the available memory falls below `1Gi`, you can define the eviction threshold as
@@ -108,12 +97,9 @@ termination.
10897

10998
To configure soft eviction thresholds, the following flags are supported:
11099

111-
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
112-
corresponding grace period would trigger a Pod eviction.
113-
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
114-
correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
115-
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
116-
pods in response to a soft eviction threshold being met.
100+
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a corresponding grace period would trigger a Pod eviction.
101+
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
102+
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
117103

118104
#### Hard Eviction Thresholds
119105

@@ -124,8 +110,7 @@ with no graceful termination.
124110

125111
To configure hard eviction thresholds, the following flag is supported:
126112

127-
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
128-
would trigger a Pod eviction.
113+
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met would trigger a Pod eviction.
129114

130115
The `kubelet` has the following default hard eviction threshold:
131116

@@ -150,10 +135,10 @@ reflects the node is under pressure.
150135

151136
The following node conditions are defined that correspond to the specified eviction signal.
152137

153-
| Node Condition | Eviction Signal | Description |
154-
|-------------------------|-------------------------------|--------------------------------------------|
155-
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
156-
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
138+
| Node Condition | Eviction Signal | Description |
139+
|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
140+
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
141+
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
157142

158143
The `kubelet` continues to report node status updates at the frequency specified by
159144
`--node-status-update-frequency` which defaults to `10s`.
@@ -168,8 +153,7 @@ as a consequence.
168153
To protect against this oscillation, the following flag is defined to control how
169154
long the `kubelet` must wait before transitioning out of a pressure condition.
170155

171-
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
172-
to wait before transitioning out of an eviction pressure condition.
156+
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has to wait before transitioning out of an eviction pressure condition.
173157

174158
The `kubelet` would ensure that it has not observed an eviction threshold being met
175159
for the specified pressure condition for the period specified before toggling the
@@ -207,17 +191,8 @@ then by [Priority](/docs/concepts/configuration/pod-priority-preemption/), and t
207191

208192
As a result, `kubelet` ranks and evicts Pods in the following order:
209193

210-
* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request.
211-
Such pods are ranked by Priority, and then usage above request.
212-
* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last.
213-
`Guaranteed` Pods are guaranteed only when requests and limits are specified for all
214-
the containers and they are equal. Such pods are guaranteed to never be evicted because
215-
of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`,
216-
and `journald`) is consuming more resources than were reserved via `system-reserved` or
217-
`kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using
218-
less than requests remaining, then the node must choose to evict such a Pod in order to
219-
preserve node stability and to limit the impact of the unexpected consumption to other Pods.
220-
In this case, it will choose to evict pods of Lowest Priority first.
194+
* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request. Such pods are ranked by Priority, and then usage above request.
195+
* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last. `Guaranteed` Pods are guaranteed only when requests and limits are specified for all the containers and they are equal. Such pods are guaranteed to never be evicted because of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`, and `journald`) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using less than requests remaining, then the node must choose to evict such a Pod in order to preserve node stability and to limit the impact of the unexpected consumption to other Pods. In this case, it will choose to evict pods of Lowest Priority first.
221196

222197
If necessary, `kubelet` evicts Pods one at a time to reclaim disk when `DiskPressure`
223198
is encountered. If the `kubelet` is responding to `inode` starvation, it reclaims
@@ -228,20 +203,21 @@ that consumes the largest amount of disk and kills those first.
228203
#### With `imagefs`
229204

230205
If `nodefs` is triggering evictions, `kubelet` sorts Pods based on the usage on `nodefs`
206+
231207
- local volumes + logs of all its containers.
232208

233209
If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable layer usage of all its containers.
234210

235211
#### Without `imagefs`
236212

237213
If `nodefs` is triggering evictions, `kubelet` sorts Pods based on their total disk usage
214+
238215
- local volumes + logs & writable layer of all its containers.
239216

240217
### Minimum eviction reclaim
241218

242219
In certain scenarios, eviction of Pods could result in reclamation of small amount of resources. This can result in
243-
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
244-
is time consuming.
220+
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, is time consuming.
245221

246222
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
247223
resource pressure, `kubelet` attempts to reclaim at least `minimum-reclaim` amount of resource below
@@ -268,10 +244,10 @@ The node reports a condition when a compute resource is under pressure. The
268244
scheduler views that condition as a signal to dissuade placing additional
269245
pods on the node.
270246

271-
| Node Condition | Scheduler Behavior |
272-
| ---------------- | ------------------------------------------------ |
273-
| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
274-
| `DiskPressure` | No new Pods are scheduled to the node. |
247+
| Node Condition | Scheduler Behavior |
248+
| ------------------| ----------------------------------------------------|
249+
| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
250+
| `DiskPressure` | No new Pods are scheduled to the node. |
275251

276252
## Node OOM Behavior
277253

@@ -280,11 +256,11 @@ the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respon
280256

281257
The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the Pod.
282258

283-
| Quality of Service | oom_score_adj |
284-
|----------------------------|-----------------------------------------------------------------------|
285-
| `Guaranteed` | -998 |
286-
| `BestEffort` | 1000 |
287-
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
259+
| Quality of Service | oom_score_adj |
260+
|--------------------|-----------------------------------------------------------------------------------|
261+
| `Guaranteed` | -998 |
262+
| `BestEffort` | 1000 |
263+
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
288264

289265
If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` calculates
290266
an `oom_score` based on the percentage of memory it's using on the node, and then add the `oom_score_adj` to get an
@@ -325,10 +301,7 @@ and trigger eviction assuming those Pods use less than their configured request.
325301

326302
### DaemonSet
327303

328-
As `Priority` is a key factor in the eviction strategy, if you do not want
329-
pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass
330-
in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if
331-
there are sufficient resources, specify a lower or default priorityClass.
304+
As `Priority` is a key factor in the eviction strategy, if you do not want pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if there are sufficient resources, specify a lower or default priorityClass.
332305

333306

334307
## Deprecation of existing feature flags to reclaim disk
@@ -338,15 +311,15 @@ there are sufficient resources, specify a lower or default priorityClass.
338311
As disk based eviction matures, the following `kubelet` flags are marked for deprecation
339312
in favor of the simpler configuration supported around eviction.
340313

341-
| Existing Flag | New Flag |
342-
| ------------- | -------- |
343-
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
344-
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
345-
| `--maximum-dead-containers` | deprecated |
346-
| `--maximum-dead-containers-per-container` | deprecated |
347-
| `--minimum-container-ttl-duration` | deprecated |
348-
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
349-
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
314+
| Existing Flag | New Flag |
315+
| ------------------------------------------ | ----------------------------------------|
316+
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
317+
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
318+
| `--maximum-dead-containers` | deprecated |
319+
| `--maximum-dead-containers-per-container` | deprecated |
320+
| `--minimum-container-ttl-duration` | deprecated |
321+
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
322+
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
350323

351324
## Known issues
352325

0 commit comments

Comments
 (0)