Skip to content

Commit 8f0424d

Browse files
pohlymrbobbytablesmpetasonTim Bannister
authored
blog: local storage features go beta (#27240)
* blog: local storage features go beta * Apply link and wording fixes Co-authored-by: Mike Petersen <[email protected]> Co-authored-by: Tim Bannister <[email protected]> Co-authored-by: Bob Killen <[email protected]> Co-authored-by: Mike Petersen <[email protected]> Co-authored-by: Tim Bannister <[email protected]>
1 parent f0d53f9 commit 8f0424d

File tree

1 file changed

+216
-0
lines changed

1 file changed

+216
-0
lines changed
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
---
2+
layout: blog
3+
title: "Local Storage: Storage Capacity Tracking, Distributed Provisioning and Generic Ephemeral Volumes hit Beta"
4+
date: 2021-04-14
5+
slug: local-storage-features-go-beta
6+
---
7+
8+
**Authors:** Patrick Ohly (Intel)
9+
10+
The ["generic ephemeral
11+
volumes"](/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes)
12+
and ["storage capacity
13+
tracking"](/docs/concepts/storage/storage-capacity/)
14+
features in Kubernetes are getting promoted to beta in Kubernetes
15+
1.21. Together with the [distributed provisioning
16+
support](https://github.com/kubernetes-csi/external-provisioner#deployment-on-each-node)
17+
in the CSI external-provisioner, development and deployment of
18+
Container Storage Interface (CSI) drivers which manage storage locally
19+
on a node become a lot easier.
20+
21+
This blog post explains how such drivers worked before and how these
22+
features can be used to make drivers simpler.
23+
24+
## Problems we are solving
25+
26+
There are drivers for local storage, like
27+
[TopoLVM](https://github.com/cybozu-go/topolvm) for traditional disks
28+
and [PMEM-CSI](https://intel.github.io/pmem-csi/latest/README.html)
29+
for [persistent memory](https://pmem.io/). They work and are ready for
30+
usage today also on older Kubernetes releases, but making that possible
31+
was not trivial.
32+
33+
### Central component required
34+
35+
The first problem is volume provisioning: it is handled through the
36+
Kubernetes control plane. Some component must react to
37+
[PersistentVolumeClaims](/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims)
38+
(PVCs)
39+
and create volumes. Usually, that is handled by a central deployment
40+
of the [CSI
41+
external-provisioner](https://kubernetes-csi.github.io/docs/external-provisioner.html)
42+
and a CSI driver component that then connects to the storage
43+
backplane. But for local storage, there is no such backplane.
44+
45+
TopoLVM solved this by having its different components communicate
46+
with each other through the Kubernetes API server by creating and
47+
reacting to custom resources. So although TopoLVM is based on CSI, a
48+
standard that is independent of a particular container orchestrator,
49+
TopoLVM only works on Kubernetes.
50+
51+
PMEM-CSI created its own storage backplane with communication through
52+
gRPC calls. Securing that communication depends on TLS certificates,
53+
which made driver deployment more complicated.
54+
55+
### Informing Pod scheduler about capacity
56+
57+
The next problem is scheduling. When volumes get created independently
58+
of pods ("immediate binding"), the CSI driver must pick a node without
59+
knowing anything about the pod(s) that are going to use it. Topology
60+
information then forces those pods to run on the node where the volume
61+
was created. If other resources like RAM or CPU are exhausted there,
62+
the pod cannot start. This can be avoided by configuring in the
63+
StorageClass that volume creation is meant to wait for the first pod
64+
that uses a volume (`volumeBinding: WaitForFirstConsumer`). In that
65+
mode, the Kubernetes scheduler tentatively picks a node based on other
66+
constraints and then the external-provisioner is asked to create a
67+
volume such that it is usable there. If local storage is exhausted,
68+
the provisioner [can
69+
ask](https://github.com/kubernetes-csi/external-provisioner/blob/master/doc/design.md)
70+
for another scheduling round. But without information about available
71+
capacity, the scheduler might always pick the same unsuitable node.
72+
73+
Both TopoLVM and PMEM-CSI solved this with scheduler extenders. This
74+
works, but it is hard to configure when deploying the driver because
75+
communication between kube-scheduler and the driver is very dependent
76+
on how the cluster was set up.
77+
78+
### Rescheduling
79+
80+
A common use case for local storage is scratch space. A better fit for
81+
that use case than persistent volumes are ephemeral volumes that get
82+
created for a pod and destroyed together with it. The initial API for
83+
supporting ephemeral volumes with CSI drivers (hence called ["*CSI*
84+
ephemeral
85+
volumes"](/docs/concepts/storage/ephemeral-volumes/#csi-ephemeral-volumes))
86+
was [designed for light-weight
87+
volumes](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190122-csi-inline-volumes.md)
88+
where volume creation is unlikely to fail. Volume creation happens
89+
after pods have been permanently scheduled onto a node, in contrast to
90+
the traditional provisioning where volume creation is tried before
91+
scheduling a pod onto a node. CSI drivers must be modified to support
92+
"CSI ephemeral volumes", which was done for TopoLVM and PMEM-CSI. But
93+
due to the design of the feature in Kubernetes, pods can get stuck
94+
permanently if storage capacity runs out on a node. The scheduler
95+
extenders try to avoid that, but cannot be 100% reliable.
96+
97+
## Enhancements in Kubernetes 1.21
98+
99+
### Distributed provisioning
100+
101+
Starting with [external-provisioner
102+
v2.1.0](https://github.com/kubernetes-csi/external-provisioner/releases/tag/v2.1.0),
103+
released for Kubernetes 1.20, provisioning can be handled by
104+
external-provisioner instances that get [deployed together with the
105+
CSI driver on each
106+
node](https://github.com/kubernetes-csi/external-provisioner#deployment-on-each-node)
107+
and then cooperate to provision volumes ("distributed
108+
provisioning"). There is no need any more to have a central component
109+
and thus no need for communication between nodes, at least not for
110+
provisioning.
111+
112+
### Storage capacity tracking
113+
114+
A scheduler extender still needs some way to find out about capacity
115+
on each node. When PMEM-CSI switched to distributed provisioning in
116+
v0.9.0, this was done by querying the metrics data exposed by the
117+
local driver containers. But it is better also for users to eliminate
118+
the need for a scheduler extender completely because the driver
119+
deployment becomes simpler. [Storage capacity
120+
tracking](/docs/concepts/storage/storage-capacity/), [introduced in
121+
1.19](/blog/2020/09/01/ephemeral-volumes-with-storage-capacity-tracking/)
122+
and promoted to beta in Kubernetes 1.21, achieves that. It works by
123+
publishing information about capacity in `CSIStorageCapacity`
124+
objects. The scheduler itself then uses that information to filter out
125+
unsuitable nodes. Because information might be not quite up-to-date,
126+
pods may still get assigned to nodes with insufficient storage, it's
127+
just less likely and the next scheduling attempt for a pod should work
128+
better once the information got refreshed.
129+
130+
### Generic ephemeral volumes
131+
132+
So CSI drivers still need the ability to recover from a bad scheduling
133+
decision, something that turned out to be impossible to implement for
134+
"CSI ephemeral volumes". ["*Generic* ephemeral
135+
volumes"](/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes),
136+
another feature that got promoted to beta in 1.21, don't have that
137+
limitation. This feature adds a controller that will create and manage
138+
PVCs with the lifetime of the Pod and therefore the normal recovery
139+
mechanism also works for them. Existing storage drivers will be able
140+
to process these PVCs without any new logic to handle this new
141+
scenario.
142+
143+
## Known limitations
144+
145+
Both generic ephemeral volumes and storage capacity tracking increase
146+
the load on the API server. Whether that is a problem depends a lot on
147+
the kind of workload, in particular how many pods have volumes and how
148+
often those need to be created and destroyed.
149+
150+
No attempt was made to model how scheduling decisions affect storage
151+
capacity. That's because the effect can vary considerably depending on
152+
how the storage system handles storage. The effect is that multiple
153+
pods with unbound volumes might get assigned to the same node even
154+
though there is only sufficient capacity for one pod. Scheduling
155+
should recover, but it would be more efficient if the scheduler knew
156+
more about storage.
157+
158+
Because storage capacity gets published by a running CSI driver and
159+
the cluster autoscaler needs information about a node that hasn't been
160+
created yet, it will currently not scale up a cluster for pods that
161+
need volumes. There is an [idea how to provide that
162+
information](https://github.com/kubernetes/autoscaler/pull/3887), but
163+
more work is needed in that area.
164+
165+
Distributed snapshotting and resizing are not currently supported. It
166+
should be doable to adapt the respective sidecar and there are
167+
tracking issues for external-snapshotter and external-resizer open
168+
already, they just need some volunteer.
169+
170+
The recovery from a bad scheduling decising can fail for pods with
171+
multiple volumes, in particular when those volumes are local to nodes:
172+
if one volume can be created and then storage is insufficient for
173+
another volume, the first volume continues to exist and forces the
174+
scheduler to put the pod onto the node of that volume. There is an
175+
idea how do deal with this, [rolling back the provision of the
176+
volume](https://github.com/kubernetes/enhancements/pull/1703), but
177+
this is only in the very early stages of brainstorming and not even a
178+
merged KEP yet. For now it is better to avoid creating pods with more
179+
than one persistent volume.
180+
181+
## Enabling the new features and next steps
182+
183+
With the feature entering beta in the 1.21 release, no additional actions are needed to enable it. Generic
184+
ephemeral volumes also work without changes in CSI drivers. For more
185+
information, see the
186+
[documentation](/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes)
187+
and the [previous blog
188+
post](/blog/2020/09/01/ephemeral-volumes-with-storage-capacity-tracking/)
189+
about it. The API has not changed at all between alpha and beta.
190+
191+
For the other two features, the external-provisioner documentation
192+
explains how CSI driver developers must change how their driver gets
193+
deployed to support [storage capacity
194+
tracking](https://github.com/kubernetes-csi/external-provisioner#capacity-support)
195+
and [distributed
196+
provisioning](https://github.com/kubernetes-csi/external-provisioner#deployment-on-each-node).
197+
These two features are independent, therefore it is okay to enable
198+
only one of them.
199+
200+
[SIG
201+
Storage](https://github.com/kubernetes/community/tree/master/sig-storage)
202+
would like to hear from you if you are using these new features. We
203+
can be reached through
204+
[email](https://groups.google.com/forum/#!forum/kubernetes-sig-storage),
205+
[Slack](https://slack.k8s.io/) (channel [`#sig-storage`](https://kubernetes.slack.com/messages/sig-storage)) and in the
206+
[regular SIG
207+
meeting](https://github.com/kubernetes/community/tree/master/sig-storage#meeting).
208+
A description of your workload would be very useful to validate design
209+
decisions, set up performance tests and eventually promote these
210+
features to GA.
211+
212+
## Acknowledgements
213+
214+
Thanks a lot to the members of the community who have contributed to these
215+
features or given feedback including members of SIG Scheduling, SIG Auth,
216+
and of course SIG Storage!

0 commit comments

Comments
 (0)