Skip to content

Commit 086bbc8

Browse files
author
Krzysztof Wilczyński
committed
KEP-4191: Split Image Filesystem add documentation
Signed-off-by: Krzysztof Wilczyński <[email protected]>
1 parent 918877e commit 086bbc8

File tree

1 file changed

+123
-40
lines changed

1 file changed

+123
-40
lines changed

content/en/docs/concepts/scheduling-eviction/node-pressure-eviction.md

Lines changed: 123 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,17 @@ weight: 100
66

77
{{<glossary_definition term_id="node-pressure-eviction" length="short">}}</br>
88

9+
{{< feature-state feature_gate_name="KubeletSeparateDiskGC" >}}
10+
11+
{{<note>}}
12+
The _split image filesystem_ feature, which enables support for the `containerfs`
13+
filesystem, adds several new eviction signals, thresholds and metrics. To use
14+
`containerfs`, the Kubernetes release v{{< skew currentVersion >}} requires the
15+
`KubeletSeparateDiskGC` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
16+
to be enabled. Currently, only CRI-O (v1.29 or higher) offers the `containerfs`
17+
filesystem support.
18+
{{</note>}}
19+
920
The {{<glossary_tooltip term_id="kubelet" text="kubelet">}} monitors resources
1021
like memory, disk space, and filesystem inodes on your cluster's nodes.
1122
When one or more of these resources reach specific consumption levels, the
@@ -61,23 +72,25 @@ The kubelet uses various parameters to make eviction decisions, like the followi
6172
### Eviction signals {#eviction-signals}
6273

6374
Eviction signals are the current state of a particular resource at a specific
64-
point in time. Kubelet uses eviction signals to make eviction decisions by
75+
point in time. The kubelet uses eviction signals to make eviction decisions by
6576
comparing the signals to eviction thresholds, which are the minimum amount of
6677
the resource that should be available on the node.
6778

6879
On Linux, the kubelet uses the following eviction signals:
6980

70-
| Eviction Signal | Description |
71-
|----------------------|---------------------------------------------------------------------------------------|
72-
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
73-
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
74-
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
75-
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
76-
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
77-
| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` |
81+
| Eviction Signal | Description |
82+
|--------------------------|---------------------------------------------------------------------------------------|
83+
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
84+
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
85+
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
86+
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
87+
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
88+
| `containerfs.available` | `containerfs.available` := `node.stats.runtime.containerfs.available` |
89+
| `containerfs.inodesFree` | `containerfs.inodesFree` := `node.stats.runtime.containerfs.inodesFree` |
90+
| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` |
7891

7992
In this table, the **Description** column shows how kubelet gets the value of the
80-
signal. Each signal supports either a percentage or a literal value. Kubelet
93+
signal. Each signal supports either a percentage or a literal value. The kubelet
8194
calculates the percentage value relative to the total capacity associated with
8295
the signal.
8396

@@ -93,16 +106,43 @@ reproduces the same set of steps that the kubelet performs to calculate
93106
file-backed memory on the inactive LRU list) from its calculation, as it assumes that
94107
memory is reclaimable under pressure.
95108

96-
The kubelet recognizes two specific filesystem identifiers:
109+
The kubelet recognizes three specific filesystem identifiers:
110+
111+
1. `nodefs`: The node's main filesystem, used for local disk volumes,
112+
emptyDir volumes not backed by memory, log storage, ephemeral storage,
113+
and more. For example, `nodefs` contains `/var/lib/kubelet`.
114+
115+
1. `imagefs`: An optional filesystem that container runtimes can use to store
116+
container images (which are the read-only layers) and container writable
117+
layers.
118+
119+
1. `containerfs`: An optional filesystem that container runtime can use to
120+
store the writeable layers. Similar to the main filesystem (see `nodefs`),
121+
it's used to store local disk volumes, emptyDir volumes not backed by memory,
122+
log storage, and ephemeral storage, except for the container images. When
123+
`containerfs` is used, the `imagefs` filesystem can be split to only store
124+
images (read-only layers) and nothing else.
125+
126+
As such, kubelet generally allows three options for container filesystems:
127+
128+
- Everything is on the single `nodefs`, also referred to as "rootfs" or
129+
simply "root", and there is no dedicated image filesystem.
97130

98-
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir
99-
volumes not backed by memory, log storage, and more.
100-
For example, `nodefs` contains `/var/lib/kubelet/`.
101-
1. `imagefs`: An optional filesystem that container runtimes use to store container
102-
images and container writable layers.
131+
- Container storage (see `nodefs`) is on a dedicated disk, and `imagefs`
132+
(writable and read-only layers) is separate from the root filesystem.
133+
This is often referred to as "split disk" (or "separate disk") filesystem.
103134

104-
Kubelet auto-discovers these filesystems and ignores other node local filesystems. Kubelet
105-
does not support other configurations.
135+
- Container filesystem `containerfs` (same as `nodefs` plus writable
136+
layers) is on root and the container images (read-only layers) are
137+
stored on separate `imagefs`. This is often referred to as "split image"
138+
filesystem.
139+
140+
The kubelet will attempt to auto-discover these filesystems with their current
141+
configuration directly from the underlying container runtime and will ignore
142+
other local node filesystems.
143+
144+
The kubelet does not support other container filesystems or storage configurations,
145+
and it does not currently support multiple filesystems for images and containers.
106146

107147
Some kubelet garbage collection features are deprecated in favor of eviction:
108148

@@ -177,6 +217,19 @@ then the values of other parameters will not be inherited as the default
177217
values and will be set to zero. In order to provide custom values, you
178218
should provide all the thresholds respectively.
179219

220+
The `containerfs.available` and `containerfs.inodesFree` (Linux nodes) default
221+
eviction thresholds will be set as follows:
222+
223+
- If a single filesystem is used for everything, then `containerfs` thresholds
224+
are set the same as `nodefs`.
225+
226+
- If separate filesystems are configured for both images and containers,
227+
then `containerfs` thresholds are set the same as `imagefs`.
228+
229+
Setting custom overrides for thresholds related to `containersfs` is currently
230+
not supported, and a warning will be issued if an attempt to do so is made; any
231+
provided custom values will, as such, be ignored.
232+
180233
## Eviction monitoring interval
181234

182235
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`,
@@ -190,11 +243,11 @@ threshold is met, independent of configured grace periods.
190243

191244
The kubelet maps eviction signals to node conditions as follows:
192245

193-
| Node Condition | Eviction Signal | Description |
194-
|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
195-
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
196-
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
197-
| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold |
246+
| Node Condition | Eviction Signal | Description |
247+
|-------------------|---------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
248+
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
249+
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, `imagefs.inodesFree`, `containerfs.available`, or `containerfs.inodesFree` | Available disk space and inodes on either the node's root filesystem, image filesystem, or container filesystem has satisfied an eviction threshold |
250+
| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold |
198251

199252
The control plane also [maps](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition)
200253
these node conditions to taints.
@@ -219,23 +272,36 @@ The kubelet tries to reclaim node-level resources before it evicts end-user pods
219272
When a `DiskPressure` node condition is reported, the kubelet reclaims node-level
220273
resources based on the filesystems on the node.
221274

275+
#### Without `imagefs` or `containerfs`
276+
277+
If the node only has a `nodefs` filesystem that meets eviction thresholds,
278+
the kubelet frees up disk space in the following order:
279+
280+
1. Garbage collect dead pods and containers.
281+
1. Delete unused images.
282+
222283
#### With `imagefs`
223284

224285
If the node has a dedicated `imagefs` filesystem for container runtimes to use,
225286
the kubelet does the following:
226287

227-
- If the `nodefs` filesystem meets the eviction thresholds, the kubelet garbage collects
228-
dead pods and containers.
288+
- If the `nodefs` filesystem meets the eviction thresholds, the kubelet garbage
289+
collects dead pods and containers.
290+
229291
- If the `imagefs` filesystem meets the eviction thresholds, the kubelet
230292
deletes all unused images.
231293

232-
#### Without `imagefs`
294+
#### With `imagefs` and `containerfs`
233295

234-
If the node only has a `nodefs` filesystem that meets eviction thresholds,
235-
the kubelet frees up disk space in the following order:
296+
If the node has a dedicated `containerfs` alongside the `imagefs` filesystem
297+
configured for the container runtimes to use, then kubelet will attempt to
298+
reclaim resources as follows:
299+
300+
- If the `containerfs` filesystem meets the eviction thresholds, the kubelet
301+
garbage collects dead pods and containers.
236302

237-
1. Garbage collect dead pods and containers
238-
1. Delete unused images
303+
- If the `imagefs` filesystem meets the eviction thresholds, the kubelet
304+
deletes all unused images.
239305

240306
### Pod selection for kubelet eviction
241307

@@ -253,6 +319,7 @@ As a result, kubelet ranks and evicts pods in the following order:
253319
1. `BestEffort` or `Burstable` pods where the usage exceeds requests. These pods
254320
are evicted based on their Priority and then by how much their usage level
255321
exceeds the request.
322+
256323
1. `Guaranteed` pods and `Burstable` pods where the usage is less than requests
257324
are evicted last, based on their Priority.
258325

@@ -283,23 +350,38 @@ the Pods' relative priority to determine the eviction order, because inodes and
283350
requests.
284351

285352
The kubelet sorts pods differently based on whether the node has a dedicated
286-
`imagefs` filesystem:
353+
`imagefs` or `containerfs` filesystem:
287354

288-
#### With `imagefs`
355+
#### Without `imagefs` or `containerfs` (`nodefs` and `imagefs` use the same filesystem) {#without-imagefs}
356+
357+
- If `nodefs` triggers evictions, the kubelet sorts pods based on their
358+
total disk usage (`local volumes + logs and a writable layer of all containers`).
359+
360+
#### With `imagefs` (`nodefs` and `imagefs` filesystems are separate) {#with-imagefs}
289361

290-
If `nodefs` is triggering evictions, the kubelet sorts pods based on `nodefs`
291-
usage (`local volumes + logs of all containers`).
362+
- If `nodefs` triggers evictions, the kubelet sorts pods based on `nodefs`
363+
usage (`local volumes + logs of all containers`).
292364

293-
If `imagefs` is triggering evictions, the kubelet sorts pods based on the
294-
writable layer usage of all containers.
365+
- If `imagefs` triggers evictions, the kubelet sorts pods based on the
366+
writable layer usage of all containers.
295367

296-
#### Without `imagefs`
368+
#### With `imagesfs` and `containerfs` (`imagefs` and `containerfs` have been split) {#with-containersfs}
297369

298-
If `nodefs` is triggering evictions, the kubelet sorts pods based on their total
299-
disk usage (`local volumes + logs & writable layer of all containers`)
370+
- If `containerfs` triggers evictions, the kubelet sorts pods based on
371+
`containerfs` usage (`local volumes + logs and a writable layer of all containers`).
372+
373+
- If `imagefs` triggers evictions, the kubelet sorts pods based on the
374+
`storage of images` rank, which represents the disk usage of a given image.
300375

301376
### Minimum eviction reclaim
302377

378+
{{<note>}}
379+
As of Kubernetes v{{< skew currentVersion >}}, you cannot set a custom value
380+
for the `containerfs.available` metric. The configuration for this specific
381+
metric will be set automatically to reflect values set for either the `nodefs`
382+
or `imagefs`, depending on the configuration.
383+
{{</note>}}
384+
303385
In some cases, pod eviction only reclaims a small amount of the starved resource.
304386
This can lead to the kubelet repeatedly hitting the configured eviction thresholds
305387
and triggering multiple evictions.
@@ -326,7 +408,8 @@ evictionMinimumReclaim:
326408
327409
In this example, if the `nodefs.available` signal meets the eviction threshold,
328410
the kubelet reclaims the resource until the signal reaches the threshold of 1GiB,
329-
and then continues to reclaim the minimum amount of 500MiB, until the available nodefs storage value reaches 1.5GiB.
411+
and then continues to reclaim the minimum amount of 500MiB, until the available
412+
nodefs storage value reaches 1.5GiB.
330413

331414
Similarly, the kubelet tries to reclaim the `imagefs` resource until the `imagefs.available`
332415
value reaches `102Gi`, representing 102 GiB of available container image storage. If the amount

0 commit comments

Comments
 (0)