|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: 'Ephemeral volumes with storage capacity tracking: EmptyDir on steroids' |
| 4 | +date: 2020-09-01 |
| 5 | +slug: ephemeral-volumes-with-storage-capacity-tracking |
| 6 | +--- |
| 7 | + |
| 8 | +**Author:** Patrick Ohly (Intel) |
| 9 | + |
| 10 | +Some applications need additional storage but don't care whether that |
| 11 | +data is stored persistently across restarts. For example, caching |
| 12 | +services are often limited by memory size and can move infrequently |
| 13 | +used data into storage that is slower than memory with little impact |
| 14 | +on overall performance. Other applications expect some read-only input |
| 15 | +data to be present in files, like configuration data or secret keys. |
| 16 | + |
| 17 | +Kubernetes already supports several kinds of such [ephemeral |
| 18 | +volumes](/docs/concepts/storage/ephemeral-volumes), but the |
| 19 | +functionality of those is limited to what is implemented inside |
| 20 | +Kubernetes. |
| 21 | + |
| 22 | +[CSI ephemeral volumes](https://kubernetes.io/blog/2020/01/21/csi-ephemeral-inline-volumes/) |
| 23 | +made it possible to extend Kubernetes with CSI |
| 24 | +drivers that provide light-weight, local volumes. These [*inject |
| 25 | +arbitrary states, such as configuration, secrets, identity, variables |
| 26 | +or similar |
| 27 | +information*](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190122-csi-inline-volumes.md#motivation). |
| 28 | +CSI drivers must be modified to support this Kubernetes feature, |
| 29 | +i.e. normal, standard-compliant CSI drivers will not work, and |
| 30 | +by design such volumes are supposed to be usable on whatever node |
| 31 | +is chosen for a pod. |
| 32 | + |
| 33 | +This is problematic for volumes which consume significant resources on |
| 34 | +a node or for special storage that is only available on some nodes. |
| 35 | +Therefore, Kubernetes 1.19 introduces two new alpha features for |
| 36 | +volumes that are conceptually more like the `EmptyDir` volumes: |
| 37 | +- [*generic* ephemeral volumes](/docs/concepts/storage/ephemeral-volumes#generic-ephemeral-volumes) and |
| 38 | +- [CSI storage capacity tracking](/docs/concepts/storage/storage-capacity). |
| 39 | + |
| 40 | +The advantages of the new approach are: |
| 41 | +- Storage can be local or network-attached. |
| 42 | +- Volumes can have a fixed size that applications are never able to exceed. |
| 43 | +- Works with any CSI driver that supports provisioning of persistent |
| 44 | + volumes and (for capacity tracking) implements the CSI `GetCapacity` call. |
| 45 | +- Volumes may have some initial data, depending on the driver and |
| 46 | + parameters. |
| 47 | +- All of the typical volume operations (snapshotting, |
| 48 | + resizing, the future storage capacity tracking, etc.) |
| 49 | + are supported. |
| 50 | +- The volumes are usable with any app controller that accepts |
| 51 | + a Pod or volume specification. |
| 52 | +- The Kubernetes scheduler itself picks suitable nodes, i.e. there is |
| 53 | + no need anymore to implement and configure scheduler extenders and |
| 54 | + mutating webhooks. |
| 55 | + |
| 56 | +This makes generic ephemeral volumes a suitable solution for several |
| 57 | +use cases: |
| 58 | + |
| 59 | +# Use cases |
| 60 | + |
| 61 | +## Persistent Memory as DRAM replacement for memcached |
| 62 | + |
| 63 | +Recent releases of memcached added [support for using Persistent |
| 64 | +Memory](https://memcached.org/blog/persistent-memory/) (PMEM) instead |
| 65 | +of standard DRAM. When deploying memcached through one of the app |
| 66 | +controllers, generic ephemeral volumes make it possible to request a PMEM volume |
| 67 | +of a certain size from a CSI driver like |
| 68 | +[PMEM-CSI](https://intel.github.io/pmem-csi/). |
| 69 | + |
| 70 | +## Local LVM storage as scratch space |
| 71 | + |
| 72 | +Applications working with data sets that exceed the RAM size can |
| 73 | +request local storage with performance characteristics or size that is |
| 74 | +not met by the normal Kubernetes `EmptyDir` volumes. For example, |
| 75 | +[TopoLVM](https://github.com/cybozu-go/topolvm) was written for that |
| 76 | +purpose. |
| 77 | + |
| 78 | +## Read-only access to volumes with data |
| 79 | + |
| 80 | +Provisioning a volume might result in a non-empty volume: |
| 81 | +- [restore a snapshot](/docs/concepts/storage/persistent-volumes/#volume-snapshot-and-restore-volume-from-snapshot-support) |
| 82 | +- [cloning a volume](/docs/concepts/storage/volume-pvc-datasource) |
| 83 | +- [generic data populators](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20200120-generic-data-populators.md) |
| 84 | + |
| 85 | +Such volumes can be mounted read-only. |
| 86 | + |
| 87 | +# How it works |
| 88 | + |
| 89 | +## Generic ephemeral volumes |
| 90 | + |
| 91 | +The key idea behind generic ephemeral volumes is that a new volume |
| 92 | +source, the so-called |
| 93 | +[`EphemeralVolumeSource`](/docs/reference/generated/kubernetes-api/#ephemeralvolumesource-v1alpha1-core) |
| 94 | +contains all fields that are needed to created a volume claim |
| 95 | +(historically called persistent volume claim, PVC). A new controller |
| 96 | +in the `kube-controller-manager` waits for Pods which embed such a |
| 97 | +volume source and then creates a PVC for that pod. To a CSI driver |
| 98 | +deployment, that PVC looks like any other, so no special support is |
| 99 | +needed. |
| 100 | + |
| 101 | +As long as these PVCs exist, they can be used like any other volume claim. In |
| 102 | +particular, they can be referenced as data source in volume cloning or |
| 103 | +snapshotting. The PVC object also holds the current status of the |
| 104 | +volume. |
| 105 | + |
| 106 | +Naming of the automatically created PVCs is deterministic: the name is |
| 107 | +a combination of Pod name and volume name, with a hyphen (`-`) in the |
| 108 | +middle. This deterministic naming makes it easier to |
| 109 | +interact with the PVC because one does not have to search for it once |
| 110 | +the Pod name and volume name are known. The downside is that the name might |
| 111 | +be in use already. This is detected by Kubernetes and then blocks Pod |
| 112 | +startup. |
| 113 | + |
| 114 | +To ensure that the volume gets deleted together with the pod, the |
| 115 | +controller makes the Pod the owner of the volume claim. When the Pod |
| 116 | +gets deleted, the normal garbage-collection mechanism also removes the |
| 117 | +claim and thus the volume. |
| 118 | + |
| 119 | +Claims select the storage driver through the normal storage class |
| 120 | +mechanism. Although storage classes with both immediate and late |
| 121 | +binding (aka `WaitForFirstConsumer`) are supported, for ephemeral |
| 122 | +volumes it makes more sense to use `WaitForFirstConsumer`: then Pod |
| 123 | +scheduling can take into account both node utilization and |
| 124 | +availability of storage when choosing a node. This is where the other |
| 125 | +new feature comes in. |
| 126 | + |
| 127 | +## Storage capacity tracking |
| 128 | + |
| 129 | +Normally, the Kubernetes scheduler has no information about where a |
| 130 | +CSI driver might be able to create a volume. It also has no way of |
| 131 | +talking directly to a CSI driver to retrieve that information. It |
| 132 | +therefore tries different nodes until it finds one where all volumes |
| 133 | +can be made available (late binding) or leaves it entirely to the |
| 134 | +driver to choose a location (immediate binding). |
| 135 | + |
| 136 | +The new [`CSIStorageCapacity` alpha |
| 137 | +API](/docs/reference/generated/kubernetes-api/v1.19/#csistoragecapacity-v1alpha1-storage-k8s-io) |
| 138 | +allows storing the necessary information in etcd where it is available to the |
| 139 | +scheduler. In contrast to support for generic ephemeral volumes, |
| 140 | +storage capacity tracking must be [enabled when deploying a CSI |
| 141 | +driver](https://github.com/kubernetes-csi/external-provisioner/blob/master/README.md#capacity-support): |
| 142 | +the `external-provisioner` must be told to publish capacity |
| 143 | +information that it then retrieves from the CSI driver through the normal |
| 144 | +`GetCapacity` call. |
| 145 | +<!-- TODO: update the link with a revision once https://github.com/kubernetes-csi/external-provisioner/pull/450 is merged --> |
| 146 | + |
| 147 | +When the Kubernetes scheduler needs to choose a node for a Pod with an |
| 148 | +unbound volume that uses late binding and the CSI driver deployment |
| 149 | +has opted into the feature by setting the [`CSIDriver.storageCapacity` |
| 150 | +flag](/docs/reference/generated/kubernetes-api/v1.19/#csidriver-v1beta1-storage-k8s-io) |
| 151 | +flag, the scheduler automatically filters out nodes that do not have |
| 152 | +access to enough storage capacity. This works for generic ephemeral |
| 153 | +and persistent volumes but *not* for CSI ephemeral volumes because the |
| 154 | +parameters of those are opaque for Kubernetes. |
| 155 | + |
| 156 | +As usual, volumes with immediate binding get created before scheduling |
| 157 | +pods, with their location chosen by the storage driver. Therefore, the |
| 158 | +external-provisioner's default configuration skips storage |
| 159 | +classes with immediate binding as the information wouldn't be used anyway. |
| 160 | + |
| 161 | +Because the Kubernetes scheduler must act on potentially outdated |
| 162 | +information, it cannot be ensured that the capacity is still available |
| 163 | +when a volume is to be created. Still, the chances that it can be created |
| 164 | +without retries should be higher. |
| 165 | + |
| 166 | +# Security |
| 167 | + |
| 168 | +## CSIStorageCapacity |
| 169 | + |
| 170 | +CSIStorageCapacity objects are namespaced. When deploying each CSI |
| 171 | +drivers in its own namespace and, as recommended, limiting the RBAC |
| 172 | +permissions for CSIStorageCapacity to that namespace, it is |
| 173 | +always obvious where the data came from. However, Kubernetes does |
| 174 | +not check that and typically drivers get installed in the same |
| 175 | +namespace anyway, so ultimately drivers are *expected to behave* and |
| 176 | +not publish incorrect data. |
| 177 | + |
| 178 | +## Generic ephemeral volumes |
| 179 | + |
| 180 | +If users have permission to create a Pod (directly or indirectly), |
| 181 | +then they can also create generic ephemeral volumes even when they do |
| 182 | +not have permission to create a volume claim. That's because RBAC |
| 183 | +permission checks are applied to the controller which creates the |
| 184 | +PVC, not the original user. This is a fundamental change that must be |
| 185 | +[taken into |
| 186 | +account](/docs/concepts/storage/ephemeral-volumes#security) before |
| 187 | +enabling the feature in clusters where untrusted users are not |
| 188 | +supposed to have permission to create volumes. |
| 189 | + |
| 190 | +# Example |
| 191 | + |
| 192 | +A [special branch](https://github.com/intel/pmem-csi/commits/kubernetes-1-19-blog-post) |
| 193 | +in PMEM-CSI contains all the necessary changes to bring up a |
| 194 | +Kubernetes 1.19 cluster inside QEMU VMs with both alpha features |
| 195 | +enabled. The PMEM-CSI driver code is used unchanged, only the |
| 196 | +deployment was updated. |
| 197 | + |
| 198 | +On a suitable machine (Linux, non-root user can use Docker - see the |
| 199 | +[QEMU and |
| 200 | +Kubernetes](https://intel.github.io/pmem-csi/0.7/docs/autotest.html#qemu-and-kubernetes) |
| 201 | +section in the PMEM-CSI documentation), the following commands bring |
| 202 | +up a cluster and install the PMEM-CSI driver: |
| 203 | + |
| 204 | +```console |
| 205 | +git clone --branch=kubernetes-1-19-blog-post https://github.com/intel/pmem-csi.git |
| 206 | +cd pmem-csi |
| 207 | +export TEST_KUBERNETES_VERSION=1.19 TEST_FEATURE_GATES=CSIStorageCapacity=true,GenericEphemeralVolume=true TEST_PMEM_REGISTRY=intel |
| 208 | +make start && echo && test/setup-deployment.sh |
| 209 | +``` |
| 210 | + |
| 211 | +If all goes well, the output contains the following usage |
| 212 | +instructions: |
| 213 | + |
| 214 | +``` |
| 215 | +The test cluster is ready. Log in with [...]/pmem-csi/_work/pmem-govm/ssh.0, run |
| 216 | +kubectl once logged in. Alternatively, use kubectl directly with the |
| 217 | +following env variable: |
| 218 | + KUBECONFIG=[...]/pmem-csi/_work/pmem-govm/kube.config |
| 219 | +
|
| 220 | +secret/pmem-csi-registry-secrets created |
| 221 | +secret/pmem-csi-node-secrets created |
| 222 | +serviceaccount/pmem-csi-controller created |
| 223 | +... |
| 224 | +To try out the pmem-csi driver ephemeral volumes: |
| 225 | + cat deploy/kubernetes-1.19/pmem-app-ephemeral.yaml | |
| 226 | + [...]/pmem-csi/_work/pmem-govm/ssh.0 kubectl create -f - |
| 227 | +``` |
| 228 | + |
| 229 | +The CSIStorageCapacity objects are not meant to be human-readable, so |
| 230 | +some post-processing is needed. The following Golang template filters |
| 231 | +all objects by the storage class that the example uses and prints the |
| 232 | +name, topology and capacity: |
| 233 | + |
| 234 | +```console |
| 235 | +kubectl get \ |
| 236 | + -o go-template='{{range .items}}{{if eq .storageClassName "pmem-csi-sc-late-binding"}}{{.metadata.name}} {{.nodeTopology.matchLabels}} {{.capacity}} |
| 237 | +{{end}}{{end}}' \ |
| 238 | + csistoragecapacities |
| 239 | +``` |
| 240 | + |
| 241 | +``` |
| 242 | +csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi |
| 243 | +csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 30716Mi |
| 244 | +csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi |
| 245 | +``` |
| 246 | + |
| 247 | +One individual object has the following content: |
| 248 | + |
| 249 | +```console |
| 250 | +kubectl describe csistoragecapacities/csisc-6cw8j |
| 251 | +``` |
| 252 | + |
| 253 | +``` |
| 254 | +Name: csisc-sqdnt |
| 255 | +Namespace: default |
| 256 | +Labels: <none> |
| 257 | +Annotations: <none> |
| 258 | +API Version: storage.k8s.io/v1alpha1 |
| 259 | +Capacity: 30716Mi |
| 260 | +Kind: CSIStorageCapacity |
| 261 | +Metadata: |
| 262 | + Creation Timestamp: 2020-08-11T15:41:03Z |
| 263 | + Generate Name: csisc- |
| 264 | + Managed Fields: |
| 265 | + ... |
| 266 | + Owner References: |
| 267 | + API Version: apps/v1 |
| 268 | + Controller: true |
| 269 | + Kind: StatefulSet |
| 270 | + Name: pmem-csi-controller |
| 271 | + UID: 590237f9-1eb4-4208-b37b-5f7eab4597d1 |
| 272 | + Resource Version: 2994 |
| 273 | + Self Link: /apis/storage.k8s.io/v1alpha1/namespaces/default/csistoragecapacities/csisc-sqdnt |
| 274 | + UID: da36215b-3b9d-404a-a4c7-3f1c3502ab13 |
| 275 | +Node Topology: |
| 276 | + Match Labels: |
| 277 | + pmem-csi.intel.com/node: pmem-csi-pmem-govm-worker1 |
| 278 | +Storage Class Name: pmem-csi-sc-late-binding |
| 279 | +Events: <none> |
| 280 | +``` |
| 281 | + |
| 282 | +Now let's create the example app with one generic ephemeral |
| 283 | +volume. The `pmem-app-ephemeral.yaml` file contains: |
| 284 | + |
| 285 | +```yaml |
| 286 | +# This example Pod definition demonstrates |
| 287 | +# how to use generic ephemeral inline volumes |
| 288 | +# with a PMEM-CSI storage class. |
| 289 | +kind: Pod |
| 290 | +apiVersion: v1 |
| 291 | +metadata: |
| 292 | + name: my-csi-app-inline-volume |
| 293 | +spec: |
| 294 | + containers: |
| 295 | + - name: my-frontend |
| 296 | + image: intel/pmem-csi-driver-test:v0.7.14 |
| 297 | + command: [ "sleep", "100000" ] |
| 298 | + volumeMounts: |
| 299 | + - mountPath: "/data" |
| 300 | + name: my-csi-volume |
| 301 | + volumes: |
| 302 | + - name: my-csi-volume |
| 303 | + ephemeral: |
| 304 | + volumeClaimTemplate: |
| 305 | + spec: |
| 306 | + accessModes: |
| 307 | + - ReadWriteOnce |
| 308 | + resources: |
| 309 | + requests: |
| 310 | + storage: 4Gi |
| 311 | + storageClassName: pmem-csi-sc-late-binding |
| 312 | +``` |
| 313 | +
|
| 314 | +After creating that as shown in the usage instructions above, we have one additional Pod and PVC: |
| 315 | +
|
| 316 | +```console |
| 317 | +kubectl get pods/my-csi-app-inline-volume -o wide |
| 318 | +``` |
| 319 | + |
| 320 | +``` |
| 321 | +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES |
| 322 | +my-csi-app-inline-volume 1/1 Running 0 6m58s 10.36.0.2 pmem-csi-pmem-govm-worker1 <none> <none> |
| 323 | +``` |
| 324 | + |
| 325 | +```console |
| 326 | +kubectl get pvc/my-csi-app-inline-volume-my-csi-volume |
| 327 | +``` |
| 328 | + |
| 329 | +``` |
| 330 | +NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE |
| 331 | +my-csi-app-inline-volume-my-csi-volume Bound pvc-c11eb7ab-a4fa-46fe-b515-b366be908823 4Gi RWO pmem-csi-sc-late-binding 9m21s |
| 332 | +``` |
| 333 | + |
| 334 | +That PVC is owned by the Pod: |
| 335 | + |
| 336 | +```console |
| 337 | +kubectl get -o yaml pvc/my-csi-app-inline-volume-my-csi-volume |
| 338 | +``` |
| 339 | + |
| 340 | +``` |
| 341 | +apiVersion: v1 |
| 342 | +kind: PersistentVolumeClaim |
| 343 | +metadata: |
| 344 | + annotations: |
| 345 | + pv.kubernetes.io/bind-completed: "yes" |
| 346 | + pv.kubernetes.io/bound-by-controller: "yes" |
| 347 | + volume.beta.kubernetes.io/storage-provisioner: pmem-csi.intel.com |
| 348 | + volume.kubernetes.io/selected-node: pmem-csi-pmem-govm-worker1 |
| 349 | + creationTimestamp: "2020-08-11T15:44:57Z" |
| 350 | + finalizers: |
| 351 | + - kubernetes.io/pvc-protection |
| 352 | + managedFields: |
| 353 | + ... |
| 354 | + name: my-csi-app-inline-volume-my-csi-volume |
| 355 | + namespace: default |
| 356 | + ownerReferences: |
| 357 | + - apiVersion: v1 |
| 358 | + blockOwnerDeletion: true |
| 359 | + controller: true |
| 360 | + kind: Pod |
| 361 | + name: my-csi-app-inline-volume |
| 362 | + uid: 75c925bf-ca8e-441a-ac67-f190b7a2265f |
| 363 | +... |
| 364 | +``` |
| 365 | + |
| 366 | +Eventually, the storage capacity information for `pmem-csi-pmem-govm-worker1` also gets updated: |
| 367 | + |
| 368 | +``` |
| 369 | +csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi |
| 370 | +csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 26620Mi |
| 371 | +csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi |
| 372 | +``` |
| 373 | + |
| 374 | +If another app needs more than 26620Mi, the Kubernetes |
| 375 | +scheduler will not pick `pmem-csi-pmem-govm-worker1` anymore. |
| 376 | + |
| 377 | + |
| 378 | +# Next steps |
| 379 | + |
| 380 | +Both features are under development. Several open questions were |
| 381 | +already raised during the alpha review process. The two enhancement |
| 382 | +proposals document the work that will be needed for migration to beta and what |
| 383 | +alternatives were already considered and rejected: |
| 384 | + |
| 385 | +* [KEP-1698: generic ephemeral inline |
| 386 | +volumes](https://github.com/kubernetes/enhancements/blob/9d7a75d/keps/sig-storage/1698-generic-ephemeral-volumes/README.md) |
| 387 | +* [KEP-1472: Storage Capacity |
| 388 | +Tracking](https://github.com/kubernetes/enhancements/tree/9d7a75d/keps/sig-storage/1472-storage-capacity-tracking) |
| 389 | + |
| 390 | +Your feedback is crucial for driving that development. SIG-Storage |
| 391 | +[meets |
| 392 | +regularly](https://github.com/kubernetes/community/tree/master/sig-storage#meetings) |
| 393 | +and can be reached via [Slack and a mailing |
| 394 | +list](https://github.com/kubernetes/community/tree/master/sig-storage#contact). |
0 commit comments