Skip to content

Commit 5729f1f

Browse files
committed
blog: generic ephemeral volumes + storage capacity tracking
These are new alpha features coming in Kubernetes 1.19.
1 parent 14b7c11 commit 5729f1f

File tree

1 file changed

+394
-0
lines changed

1 file changed

+394
-0
lines changed
Lines changed: 394 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,394 @@
1+
---
2+
layout: blog
3+
title: 'Ephemeral volumes with storage capacity tracking: EmptyDir on steroids'
4+
date: 2020-09-01
5+
slug: ephemeral-volumes-with-storage-capacity-tracking
6+
---
7+
8+
**Author:** Patrick Ohly (Intel)
9+
10+
Some applications need additional storage but don't care whether that
11+
data is stored persistently across restarts. For example, caching
12+
services are often limited by memory size and can move infrequently
13+
used data into storage that is slower than memory with little impact
14+
on overall performance. Other applications expect some read-only input
15+
data to be present in files, like configuration data or secret keys.
16+
17+
Kubernetes already supports several kinds of such [ephemeral
18+
volumes](/docs/concepts/storage/ephemeral-volumes), but the
19+
functionality of those is limited to what is implemented inside
20+
Kubernetes.
21+
22+
[CSI ephemeral volumes](https://kubernetes.io/blog/2020/01/21/csi-ephemeral-inline-volumes/)
23+
made it possible to extend Kubernetes with CSI
24+
drivers that provide light-weight, local volumes. These [*inject
25+
arbitrary states, such as configuration, secrets, identity, variables
26+
or similar
27+
information*](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190122-csi-inline-volumes.md#motivation).
28+
CSI drivers must be modified to support this Kubernetes feature,
29+
i.e. normal, standard-compliant CSI drivers will not work, and
30+
by design such volumes are supposed to be usable on whatever node
31+
is chosen for a pod.
32+
33+
This is problematic for volumes which consume significant resources on
34+
a node or for special storage that is only available on some nodes.
35+
Therefore, Kubernetes 1.19 introduces two new alpha features for
36+
volumes that are conceptually more like the `EmptyDir` volumes:
37+
- [*generic* ephemeral volumes](/docs/concepts/storage/ephemeral-volumes#generic-ephemeral-volumes) and
38+
- [CSI storage capacity tracking](/docs/concepts/storage/storage-capacity).
39+
40+
The advantages of the new approach are:
41+
- Storage can be local or network-attached.
42+
- Volumes can have a fixed size that applications are never able to exceed.
43+
- Works with any CSI driver that supports provisioning of persistent
44+
volumes and (for capacity tracking) implements the CSI `GetCapacity` call.
45+
- Volumes may have some initial data, depending on the driver and
46+
parameters.
47+
- All of the typical volume operations (snapshotting,
48+
resizing, the future storage capacity tracking, etc.)
49+
are supported.
50+
- The volumes are usable with any app controller that accepts
51+
a Pod or volume specification.
52+
- The Kubernetes scheduler itself picks suitable nodes, i.e. there is
53+
no need anymore to implement and configure scheduler extenders and
54+
mutating webhooks.
55+
56+
This makes generic ephemeral volumes a suitable solution for several
57+
use cases:
58+
59+
# Use cases
60+
61+
## Persistent Memory as DRAM replacement for memcached
62+
63+
Recent releases of memcached added [support for using Persistent
64+
Memory](https://memcached.org/blog/persistent-memory/) (PMEM) instead
65+
of standard DRAM. When deploying memcached through one of the app
66+
controllers, generic ephemeral volumes make it possible to request a PMEM volume
67+
of a certain size from a CSI driver like
68+
[PMEM-CSI](https://intel.github.io/pmem-csi/).
69+
70+
## Local LVM storage as scratch space
71+
72+
Applications working with data sets that exceed the RAM size can
73+
request local storage with performance characteristics or size that is
74+
not met by the normal Kubernetes `EmptyDir` volumes. For example,
75+
[TopoLVM](https://github.com/cybozu-go/topolvm) was written for that
76+
purpose.
77+
78+
## Read-only access to volumes with data
79+
80+
Provisioning a volume might result in a non-empty volume:
81+
- [restore a snapshot](/docs/concepts/storage/persistent-volumes/#volume-snapshot-and-restore-volume-from-snapshot-support)
82+
- [cloning a volume](/docs/concepts/storage/volume-pvc-datasource)
83+
- [generic data populators](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20200120-generic-data-populators.md)
84+
85+
Such volumes can be mounted read-only.
86+
87+
# How it works
88+
89+
## Generic ephemeral volumes
90+
91+
The key idea behind generic ephemeral volumes is that a new volume
92+
source, the so-called
93+
[`EphemeralVolumeSource`](/docs/reference/generated/kubernetes-api/#ephemeralvolumesource-v1alpha1-core)
94+
contains all fields that are needed to created a volume claim
95+
(historically called persistent volume claim, PVC). A new controller
96+
in the `kube-controller-manager` waits for Pods which embed such a
97+
volume source and then creates a PVC for that pod. To a CSI driver
98+
deployment, that PVC looks like any other, so no special support is
99+
needed.
100+
101+
As long as these PVCs exist, they can be used like any other volume claim. In
102+
particular, they can be referenced as data source in volume cloning or
103+
snapshotting. The PVC object also holds the current status of the
104+
volume.
105+
106+
Naming of the automatically created PVCs is deterministic: the name is
107+
a combination of Pod name and volume name, with a hyphen (`-`) in the
108+
middle. This deterministic naming makes it easier to
109+
interact with the PVC because one does not have to search for it once
110+
the Pod name and volume name are known. The downside is that the name might
111+
be in use already. This is detected by Kubernetes and then blocks Pod
112+
startup.
113+
114+
To ensure that the volume gets deleted together with the pod, the
115+
controller makes the Pod the owner of the volume claim. When the Pod
116+
gets deleted, the normal garbage-collection mechanism also removes the
117+
claim and thus the volume.
118+
119+
Claims select the storage driver through the normal storage class
120+
mechanism. Although storage classes with both immediate and late
121+
binding (aka `WaitForFirstConsumer`) are supported, for ephemeral
122+
volumes it makes more sense to use `WaitForFirstConsumer`: then Pod
123+
scheduling can take into account both node utilization and
124+
availability of storage when choosing a node. This is where the other
125+
new feature comes in.
126+
127+
## Storage capacity tracking
128+
129+
Normally, the Kubernetes scheduler has no information about where a
130+
CSI driver might be able to create a volume. It also has no way of
131+
talking directly to a CSI driver to retrieve that information. It
132+
therefore tries different nodes until it finds one where all volumes
133+
can be made available (late binding) or leaves it entirely to the
134+
driver to choose a location (immediate binding).
135+
136+
The new [`CSIStorageCapacity` alpha
137+
API](/docs/reference/generated/kubernetes-api/v1.19/#csistoragecapacity-v1alpha1-storage-k8s-io)
138+
allows storing the necessary information in etcd where it is available to the
139+
scheduler. In contrast to support for generic ephemeral volumes,
140+
storage capacity tracking must be [enabled when deploying a CSI
141+
driver](https://github.com/kubernetes-csi/external-provisioner/blob/master/README.md#capacity-support):
142+
the `external-provisioner` must be told to publish capacity
143+
information that it then retrieves from the CSI driver through the normal
144+
`GetCapacity` call.
145+
<!-- TODO: update the link with a revision once https://github.com/kubernetes-csi/external-provisioner/pull/450 is merged -->
146+
147+
When the Kubernetes scheduler needs to choose a node for a Pod with an
148+
unbound volume that uses late binding and the CSI driver deployment
149+
has opted into the feature by setting the [`CSIDriver.storageCapacity`
150+
flag](/docs/reference/generated/kubernetes-api/v1.19/#csidriver-v1beta1-storage-k8s-io)
151+
flag, the scheduler automatically filters out nodes that do not have
152+
access to enough storage capacity. This works for generic ephemeral
153+
and persistent volumes but *not* for CSI ephemeral volumes because the
154+
parameters of those are opaque for Kubernetes.
155+
156+
As usual, volumes with immediate binding get created before scheduling
157+
pods, with their location chosen by the storage driver. Therefore, the
158+
external-provisioner's default configuration skips storage
159+
classes with immediate binding as the information wouldn't be used anyway.
160+
161+
Because the Kubernetes scheduler must act on potentially outdated
162+
information, it cannot be ensured that the capacity is still available
163+
when a volume is to be created. Still, the chances that it can be created
164+
without retries should be higher.
165+
166+
# Security
167+
168+
## CSIStorageCapacity
169+
170+
CSIStorageCapacity objects are namespaced. When deploying each CSI
171+
drivers in its own namespace and, as recommended, limiting the RBAC
172+
permissions for CSIStorageCapacity to that namespace, it is
173+
always obvious where the data came from. However, Kubernetes does
174+
not check that and typically drivers get installed in the same
175+
namespace anyway, so ultimately drivers are *expected to behave* and
176+
not publish incorrect data.
177+
178+
## Generic ephemeral volumes
179+
180+
If users have permission to create a Pod (directly or indirectly),
181+
then they can also create generic ephemeral volumes even when they do
182+
not have permission to create a volume claim. That's because RBAC
183+
permission checks are applied to the controller which creates the
184+
PVC, not the original user. This is a fundamental change that must be
185+
[taken into
186+
account](/docs/concepts/storage/ephemeral-volumes#security) before
187+
enabling the feature in clusters where untrusted users are not
188+
supposed to have permission to create volumes.
189+
190+
# Example
191+
192+
A [special branch](https://github.com/intel/pmem-csi/commits/kubernetes-1-19-blog-post)
193+
in PMEM-CSI contains all the necessary changes to bring up a
194+
Kubernetes 1.19 cluster inside QEMU VMs with both alpha features
195+
enabled. The PMEM-CSI driver code is used unchanged, only the
196+
deployment was updated.
197+
198+
On a suitable machine (Linux, non-root user can use Docker - see the
199+
[QEMU and
200+
Kubernetes](https://intel.github.io/pmem-csi/0.7/docs/autotest.html#qemu-and-kubernetes)
201+
section in the PMEM-CSI documentation), the following commands bring
202+
up a cluster and install the PMEM-CSI driver:
203+
204+
```console
205+
git clone --branch=kubernetes-1-19-blog-post https://github.com/intel/pmem-csi.git
206+
cd pmem-csi
207+
export TEST_KUBERNETES_VERSION=1.19 TEST_FEATURE_GATES=CSIStorageCapacity=true,GenericEphemeralVolume=true TEST_PMEM_REGISTRY=intel
208+
make start && echo && test/setup-deployment.sh
209+
```
210+
211+
If all goes well, the output contains the following usage
212+
instructions:
213+
214+
```
215+
The test cluster is ready. Log in with [...]/pmem-csi/_work/pmem-govm/ssh.0, run
216+
kubectl once logged in. Alternatively, use kubectl directly with the
217+
following env variable:
218+
KUBECONFIG=[...]/pmem-csi/_work/pmem-govm/kube.config
219+
220+
secret/pmem-csi-registry-secrets created
221+
secret/pmem-csi-node-secrets created
222+
serviceaccount/pmem-csi-controller created
223+
...
224+
To try out the pmem-csi driver ephemeral volumes:
225+
cat deploy/kubernetes-1.19/pmem-app-ephemeral.yaml |
226+
[...]/pmem-csi/_work/pmem-govm/ssh.0 kubectl create -f -
227+
```
228+
229+
The CSIStorageCapacity objects are not meant to be human-readable, so
230+
some post-processing is needed. The following Golang template filters
231+
all objects by the storage class that the example uses and prints the
232+
name, topology and capacity:
233+
234+
```console
235+
kubectl get \
236+
-o go-template='{{range .items}}{{if eq .storageClassName "pmem-csi-sc-late-binding"}}{{.metadata.name}} {{.nodeTopology.matchLabels}} {{.capacity}}
237+
{{end}}{{end}}' \
238+
csistoragecapacities
239+
```
240+
241+
```
242+
csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi
243+
csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 30716Mi
244+
csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi
245+
```
246+
247+
One individual object has the following content:
248+
249+
```console
250+
kubectl describe csistoragecapacities/csisc-6cw8j
251+
```
252+
253+
```
254+
Name: csisc-sqdnt
255+
Namespace: default
256+
Labels: <none>
257+
Annotations: <none>
258+
API Version: storage.k8s.io/v1alpha1
259+
Capacity: 30716Mi
260+
Kind: CSIStorageCapacity
261+
Metadata:
262+
Creation Timestamp: 2020-08-11T15:41:03Z
263+
Generate Name: csisc-
264+
Managed Fields:
265+
...
266+
Owner References:
267+
API Version: apps/v1
268+
Controller: true
269+
Kind: StatefulSet
270+
Name: pmem-csi-controller
271+
UID: 590237f9-1eb4-4208-b37b-5f7eab4597d1
272+
Resource Version: 2994
273+
Self Link: /apis/storage.k8s.io/v1alpha1/namespaces/default/csistoragecapacities/csisc-sqdnt
274+
UID: da36215b-3b9d-404a-a4c7-3f1c3502ab13
275+
Node Topology:
276+
Match Labels:
277+
pmem-csi.intel.com/node: pmem-csi-pmem-govm-worker1
278+
Storage Class Name: pmem-csi-sc-late-binding
279+
Events: <none>
280+
```
281+
282+
Now let's create the example app with one generic ephemeral
283+
volume. The `pmem-app-ephemeral.yaml` file contains:
284+
285+
```yaml
286+
# This example Pod definition demonstrates
287+
# how to use generic ephemeral inline volumes
288+
# with a PMEM-CSI storage class.
289+
kind: Pod
290+
apiVersion: v1
291+
metadata:
292+
name: my-csi-app-inline-volume
293+
spec:
294+
containers:
295+
- name: my-frontend
296+
image: intel/pmem-csi-driver-test:v0.7.14
297+
command: [ "sleep", "100000" ]
298+
volumeMounts:
299+
- mountPath: "/data"
300+
name: my-csi-volume
301+
volumes:
302+
- name: my-csi-volume
303+
ephemeral:
304+
volumeClaimTemplate:
305+
spec:
306+
accessModes:
307+
- ReadWriteOnce
308+
resources:
309+
requests:
310+
storage: 4Gi
311+
storageClassName: pmem-csi-sc-late-binding
312+
```
313+
314+
After creating that as shown in the usage instructions above, we have one additional Pod and PVC:
315+
316+
```console
317+
kubectl get pods/my-csi-app-inline-volume -o wide
318+
```
319+
320+
```
321+
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
322+
my-csi-app-inline-volume 1/1 Running 0 6m58s 10.36.0.2 pmem-csi-pmem-govm-worker1 <none> <none>
323+
```
324+
325+
```console
326+
kubectl get pvc/my-csi-app-inline-volume-my-csi-volume
327+
```
328+
329+
```
330+
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
331+
my-csi-app-inline-volume-my-csi-volume Bound pvc-c11eb7ab-a4fa-46fe-b515-b366be908823 4Gi RWO pmem-csi-sc-late-binding 9m21s
332+
```
333+
334+
That PVC is owned by the Pod:
335+
336+
```console
337+
kubectl get -o yaml pvc/my-csi-app-inline-volume-my-csi-volume
338+
```
339+
340+
```
341+
apiVersion: v1
342+
kind: PersistentVolumeClaim
343+
metadata:
344+
annotations:
345+
pv.kubernetes.io/bind-completed: "yes"
346+
pv.kubernetes.io/bound-by-controller: "yes"
347+
volume.beta.kubernetes.io/storage-provisioner: pmem-csi.intel.com
348+
volume.kubernetes.io/selected-node: pmem-csi-pmem-govm-worker1
349+
creationTimestamp: "2020-08-11T15:44:57Z"
350+
finalizers:
351+
- kubernetes.io/pvc-protection
352+
managedFields:
353+
...
354+
name: my-csi-app-inline-volume-my-csi-volume
355+
namespace: default
356+
ownerReferences:
357+
- apiVersion: v1
358+
blockOwnerDeletion: true
359+
controller: true
360+
kind: Pod
361+
name: my-csi-app-inline-volume
362+
uid: 75c925bf-ca8e-441a-ac67-f190b7a2265f
363+
...
364+
```
365+
366+
Eventually, the storage capacity information for `pmem-csi-pmem-govm-worker1` also gets updated:
367+
368+
```
369+
csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi
370+
csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 26620Mi
371+
csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi
372+
```
373+
374+
If another app needs more than 26620Mi, the Kubernetes
375+
scheduler will not pick `pmem-csi-pmem-govm-worker1` anymore.
376+
377+
378+
# Next steps
379+
380+
Both features are under development. Several open questions were
381+
already raised during the alpha review process. The two enhancement
382+
proposals document the work that will be needed for migration to beta and what
383+
alternatives were already considered and rejected:
384+
385+
* [KEP-1698: generic ephemeral inline
386+
volumes](https://github.com/kubernetes/enhancements/blob/9d7a75d/keps/sig-storage/1698-generic-ephemeral-volumes/README.md)
387+
* [KEP-1472: Storage Capacity
388+
Tracking](https://github.com/kubernetes/enhancements/tree/9d7a75d/keps/sig-storage/1472-storage-capacity-tracking)
389+
390+
Your feedback is crucial for driving that development. SIG-Storage
391+
[meets
392+
regularly](https://github.com/kubernetes/community/tree/master/sig-storage#meetings)
393+
and can be reached via [Slack and a mailing
394+
list](https://github.com/kubernetes/community/tree/master/sig-storage#contact).

0 commit comments

Comments
 (0)