|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Local Storage: Storage Capacity Tracking, Distributed Provisioning and Generic Ephemeral Volumes hit Beta" |
| 4 | +date: 2021-04-14 |
| 5 | +slug: local-storage-features-go-beta |
| 6 | +--- |
| 7 | + |
| 8 | + **Authors:** Patrick Ohly (Intel) |
| 9 | + |
| 10 | +The ["generic ephemeral |
| 11 | +volumes"](/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes) |
| 12 | +and ["storage capacity |
| 13 | +tracking"](/docs/concepts/storage/storage-capacity/) |
| 14 | +features in Kubernetes are getting promoted to beta in Kubernetes |
| 15 | +1.21. Together with the [distributed provisioning |
| 16 | +support](https://github.com/kubernetes-csi/external-provisioner#deployment-on-each-node) |
| 17 | +in the CSI external-provisioner, development and deployment of |
| 18 | +Container Storage Interface (CSI) drivers which manage storage locally |
| 19 | +on a node become a lot easier. |
| 20 | + |
| 21 | +This blog post explains how such drivers worked before and how these |
| 22 | +features can be used to make drivers simpler. |
| 23 | + |
| 24 | +## Problems we are solving |
| 25 | + |
| 26 | +There are drivers for local storage, like |
| 27 | +[TopoLVM](https://github.com/cybozu-go/topolvm) for traditional disks |
| 28 | +and [PMEM-CSI](https://intel.github.io/pmem-csi/latest/README.html) |
| 29 | +for [persistent memory](https://pmem.io/). They work and are ready for |
| 30 | +usage today also on older Kubernetes releases, but making that possible |
| 31 | +was not trivial. |
| 32 | + |
| 33 | +### Central component required |
| 34 | + |
| 35 | +The first problem is volume provisioning: it is handled through the |
| 36 | +Kubernetes control plane. Some component must react to |
| 37 | +[PersistentVolumeClaims](/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) |
| 38 | +(PVCs) |
| 39 | +and create volumes. Usually, that is handled by a central deployment |
| 40 | +of the [CSI |
| 41 | +external-provisioner](https://kubernetes-csi.github.io/docs/external-provisioner.html) |
| 42 | +and a CSI driver component that then connects to the storage |
| 43 | +backplane. But for local storage, there is no such backplane. |
| 44 | + |
| 45 | +TopoLVM solved this by having its different components communicate |
| 46 | +with each other through the Kubernetes API server by creating and |
| 47 | +reacting to custom resources. So although TopoLVM is based on CSI, a |
| 48 | +standard that is independent of a particular container orchestrator, |
| 49 | +TopoLVM only works on Kubernetes. |
| 50 | + |
| 51 | +PMEM-CSI created its own storage backplane with communication through |
| 52 | +gRPC calls. Securing that communication depends on TLS certificates, |
| 53 | +which made driver deployment more complicated. |
| 54 | + |
| 55 | +### Informing Pod scheduler about capacity |
| 56 | + |
| 57 | +The next problem is scheduling. When volumes get created independently |
| 58 | +of pods ("immediate binding"), the CSI driver must pick a node without |
| 59 | +knowing anything about the pod(s) that are going to use it. Topology |
| 60 | +information then forces those pods to run on the node where the volume |
| 61 | +was created. If other resources like RAM or CPU are exhausted there, |
| 62 | +the pod cannot start. This can be avoided by configuring in the |
| 63 | +StorageClass that volume creation is meant to wait for the first pod |
| 64 | +that uses a volume (`volumeBinding: WaitForFirstConsumer`). In that |
| 65 | +mode, the Kubernetes scheduler tentatively picks a node based on other |
| 66 | +constraints and then the external-provisioner is asked to create a |
| 67 | +volume such that it is usable there. If local storage is exhausted, |
| 68 | +the provisioner [can |
| 69 | +ask](https://github.com/kubernetes-csi/external-provisioner/blob/master/doc/design.md) |
| 70 | +for another scheduling round. But without information about available |
| 71 | +capacity, the scheduler might always pick the same unsuitable node. |
| 72 | + |
| 73 | +Both TopoLVM and PMEM-CSI solved this with scheduler extenders. This |
| 74 | +works, but it is hard to configure when deploying the driver because |
| 75 | +communication between kube-scheduler and the driver is very dependent |
| 76 | +on how the cluster was set up. |
| 77 | + |
| 78 | +### Rescheduling |
| 79 | + |
| 80 | +A common use case for local storage is scratch space. A better fit for |
| 81 | +that use case than persistent volumes are ephemeral volumes that get |
| 82 | +created for a pod and destroyed together with it. The initial API for |
| 83 | +supporting ephemeral volumes with CSI drivers (hence called ["*CSI* |
| 84 | +ephemeral |
| 85 | +volumes"](/docs/concepts/storage/ephemeral-volumes/#csi-ephemeral-volumes)) |
| 86 | +was [designed for light-weight |
| 87 | +volumes](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190122-csi-inline-volumes.md) |
| 88 | +where volume creation is unlikely to fail. Volume creation happens |
| 89 | +after pods have been permanently scheduled onto a node, in contrast to |
| 90 | +the traditional provisioning where volume creation is tried before |
| 91 | +scheduling a pod onto a node. CSI drivers must be modified to support |
| 92 | +"CSI ephemeral volumes", which was done for TopoLVM and PMEM-CSI. But |
| 93 | +due to the design of the feature in Kubernetes, pods can get stuck |
| 94 | +permanently if storage capacity runs out on a node. The scheduler |
| 95 | +extenders try to avoid that, but cannot be 100% reliable. |
| 96 | + |
| 97 | +## Enhancements in Kubernetes 1.21 |
| 98 | + |
| 99 | +### Distributed provisioning |
| 100 | + |
| 101 | +Starting with [external-provisioner |
| 102 | +v2.1.0](https://github.com/kubernetes-csi/external-provisioner/releases/tag/v2.1.0), |
| 103 | +released for Kubernetes 1.20, provisioning can be handled by |
| 104 | +external-provisioner instances that get [deployed together with the |
| 105 | +CSI driver on each |
| 106 | +node](https://github.com/kubernetes-csi/external-provisioner#deployment-on-each-node) |
| 107 | +and then cooperate to provision volumes ("distributed |
| 108 | +provisioning"). There is no need any more to have a central component |
| 109 | +and thus no need for communication between nodes, at least not for |
| 110 | +provisioning. |
| 111 | + |
| 112 | +### Storage capacity tracking |
| 113 | + |
| 114 | +A scheduler extender still needs some way to find out about capacity |
| 115 | +on each node. When PMEM-CSI switched to distributed provisioning in |
| 116 | +v0.9.0, this was done by querying the metrics data exposed by the |
| 117 | +local driver containers. But it is better also for users to eliminate |
| 118 | +the need for a scheduler extender completely because the driver |
| 119 | +deployment becomes simpler. [Storage capacity |
| 120 | +tracking](/docs/concepts/storage/storage-capacity/), [introduced in |
| 121 | +1.19](/blog/2020/09/01/ephemeral-volumes-with-storage-capacity-tracking/) |
| 122 | +and promoted to beta in Kubernetes 1.21, achieves that. It works by |
| 123 | +publishing information about capacity in `CSIStorageCapacity` |
| 124 | +objects. The scheduler itself then uses that information to filter out |
| 125 | +unsuitable nodes. Because information might be not quite up-to-date, |
| 126 | +pods may still get assigned to nodes with insufficient storage, it's |
| 127 | +just less likely and the next scheduling attempt for a pod should work |
| 128 | +better once the information got refreshed. |
| 129 | + |
| 130 | +### Generic ephemeral volumes |
| 131 | + |
| 132 | +So CSI drivers still need the ability to recover from a bad scheduling |
| 133 | +decision, something that turned out to be impossible to implement for |
| 134 | +"CSI ephemeral volumes". ["*Generic* ephemeral |
| 135 | +volumes"](/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes), |
| 136 | +another feature that got promoted to beta in 1.21, don't have that |
| 137 | +limitation. This feature adds a controller that will create and manage |
| 138 | +PVCs with the lifetime of the Pod and therefore the normal recovery |
| 139 | +mechanism also works for them. Existing storage drivers will be able |
| 140 | +to process these PVCs without any new logic to handle this new |
| 141 | +scenario. |
| 142 | + |
| 143 | +## Known limitations |
| 144 | + |
| 145 | +Both generic ephemeral volumes and storage capacity tracking increase |
| 146 | +the load on the API server. Whether that is a problem depends a lot on |
| 147 | +the kind of workload, in particular how many pods have volumes and how |
| 148 | +often those need to be created and destroyed. |
| 149 | + |
| 150 | +No attempt was made to model how scheduling decisions affect storage |
| 151 | +capacity. That's because the effect can vary considerably depending on |
| 152 | +how the storage system handles storage. The effect is that multiple |
| 153 | +pods with unbound volumes might get assigned to the same node even |
| 154 | +though there is only sufficient capacity for one pod. Scheduling |
| 155 | +should recover, but it would be more efficient if the scheduler knew |
| 156 | +more about storage. |
| 157 | + |
| 158 | +Because storage capacity gets published by a running CSI driver and |
| 159 | +the cluster autoscaler needs information about a node that hasn't been |
| 160 | +created yet, it will currently not scale up a cluster for pods that |
| 161 | +need volumes. There is an [idea how to provide that |
| 162 | +information](https://github.com/kubernetes/autoscaler/pull/3887), but |
| 163 | +more work is needed in that area. |
| 164 | + |
| 165 | +Distributed snapshotting and resizing are not currently supported. It |
| 166 | +should be doable to adapt the respective sidecar and there are |
| 167 | +tracking issues for external-snapshotter and external-resizer open |
| 168 | +already, they just need some volunteer. |
| 169 | + |
| 170 | +The recovery from a bad scheduling decising can fail for pods with |
| 171 | +multiple volumes, in particular when those volumes are local to nodes: |
| 172 | +if one volume can be created and then storage is insufficient for |
| 173 | +another volume, the first volume continues to exist and forces the |
| 174 | +scheduler to put the pod onto the node of that volume. There is an |
| 175 | +idea how do deal with this, [rolling back the provision of the |
| 176 | +volume](https://github.com/kubernetes/enhancements/pull/1703), but |
| 177 | +this is only in the very early stages of brainstorming and not even a |
| 178 | +merged KEP yet. For now it is better to avoid creating pods with more |
| 179 | +than one persistent volume. |
| 180 | + |
| 181 | +## Enabling the new features and next steps |
| 182 | + |
| 183 | +With the feature entering beta in the 1.21 release, no additional actions are needed to enable it. Generic |
| 184 | +ephemeral volumes also work without changes in CSI drivers. For more |
| 185 | +information, see the |
| 186 | +[documentation](/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes) |
| 187 | +and the [previous blog |
| 188 | +post](/blog/2020/09/01/ephemeral-volumes-with-storage-capacity-tracking/) |
| 189 | +about it. The API has not changed at all between alpha and beta. |
| 190 | + |
| 191 | +For the other two features, the external-provisioner documentation |
| 192 | +explains how CSI driver developers must change how their driver gets |
| 193 | +deployed to support [storage capacity |
| 194 | +tracking](https://github.com/kubernetes-csi/external-provisioner#capacity-support) |
| 195 | +and [distributed |
| 196 | +provisioning](https://github.com/kubernetes-csi/external-provisioner#deployment-on-each-node). |
| 197 | +These two features are independent, therefore it is okay to enable |
| 198 | +only one of them. |
| 199 | + |
| 200 | +[SIG |
| 201 | +Storage](https://github.com/kubernetes/community/tree/master/sig-storage) |
| 202 | +would like to hear from you if you are using these new features. We |
| 203 | +can be reached through |
| 204 | +[email](https://groups.google.com/forum/#!forum/kubernetes-sig-storage), |
| 205 | +[Slack](https://slack.k8s.io/) (channel [`#sig-storage`](https://kubernetes.slack.com/messages/sig-storage)) and in the |
| 206 | +[regular SIG |
| 207 | +meeting](https://github.com/kubernetes/community/tree/master/sig-storage#meeting). |
| 208 | +A description of your workload would be very useful to validate design |
| 209 | +decisions, set up performance tests and eventually promote these |
| 210 | +features to GA. |
| 211 | + |
| 212 | +## Acknowledgements |
| 213 | + |
| 214 | +Thanks a lot to the members of the community who have contributed to these |
| 215 | +features or given feedback including members of SIG Scheduling, SIG Auth, |
| 216 | +and of course SIG Storage! |
0 commit comments