Skip to content

Commit a3d8015

Browse files
committed
storage capacity: move autoscaler prototype to alternatives
This makes it clear that the prototype is not the suggested solution.
1 parent 73cb6a3 commit a3d8015

File tree

1 file changed

+83
-61
lines changed
  • keps/sig-storage/1472-storage-capacity-tracking

1 file changed

+83
-61
lines changed

keps/sig-storage/1472-storage-capacity-tracking/README.md

Lines changed: 83 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@
6363
- [Example: local storage](#example-local-storage-2)
6464
- [Example: affect of storage classes](#example-affect-of-storage-classes-2)
6565
- [Example: network attached storage](#example-network-attached-storage-2)
66+
- [Generic autoscaler support](#generic-autoscaler-support)
6667
- [Prior work](#prior-work)
6768
<!-- /toc -->
6869

@@ -1224,69 +1225,21 @@ that logic is implemented inside the CSI driver. The CSI driver doesn't know
12241225
about hardware that hasn't been provisioned yet and doesn't know about
12251226
autoscaling.
12261227

1227-
This problem can be solved by the cluster administrator. They can find out how
1228-
much storage will be made available by new nodes, for example by running
1229-
experiments, and then configure the cluster so that this information is
1230-
available to the autoscaler. This can be done with the existing
1231-
CSIStorageCapacity API for node-local storage as follows:
1228+
One potential approach gets discussed [below](#generic-autoscaler-support)
1229+
under alternatives. However, it puts a considerable burden on the cluster
1230+
administrator to configure everything correctly and only scales up a cluster
1231+
one node at a time.
12321232

1233-
- When creating a fictional Node object from an existing Node in
1234-
a node group, autoscaler must modify the topology labels of the CSI
1235-
driver(s) in the cluster so that they define a new topology segment.
1236-
For example, topology.hostpath.csi/node=aks-workerpool.* has to
1237-
be replaced with topology.hostpath.csi/node=aks-workerpool-template.
1238-
Because these labels are opaque to the autoscaler, the cluster
1239-
administrator must configure these transformations, for example
1240-
via regular expression search/replace.
1241-
- For scale up from zero, a label like
1242-
topology.hostpath.csi/node=aks-workerpool-template must be added to the
1243-
configuration of the node pool.
1244-
- For each storage class, the cluster administrator can then create
1245-
CSIStorageCapacity objects that provide the capacity information for these
1246-
fictional topology segments.
1247-
- When the volume binder plugin for the scheduler runs inside the autoscaler,
1248-
it works exactly as in the scheduler and will accept nodes where the manually
1249-
created CSIStorageCapacity indicate that sufficient storage is (or rather,
1250-
will be) available.
1251-
- Because the CSI driver will not run immediately on new nodes, autoscaler has
1252-
to wait for it before considering the node ready. If it doesn't do that, it
1253-
might incorrectly scale up further because storage capacity checks will fail
1254-
for a new, unused node until the CSI driver provides CSIStorageCapacity
1255-
objects for it. This can be implemented in a generic way for all CSI drivers
1256-
by adding a readiness check to the autoscaler that compares the existing
1257-
CSIStorageCapacity objects against the expected ones for the fictional node.
1233+
To address these two problems, further work is needed to determine:
1234+
- How the CSI driver can provide information to the autoscaler
1235+
to enable simulated volume provisioning (total capacity
1236+
of a pristine simulated node, constraints for volume sizes in
1237+
the storage system).
1238+
- How to use that information to support batch scheduling in the
1239+
autoscaler.
12581240

1259-
A proof-of-concept of this approach is available in
1260-
https://github.com/kubernetes/autoscaler/pull/3887 and has been used
1261-
successfully to scale an Azure cluster up and down with csi-driver-host-path as
1262-
CSI driver. However, due to the lack of storage capacity modeling, scale up
1263-
happens slowly and configuring the cluster correctly is complex. Whether that
1264-
is good enough or insufficient depends on the use cases for storage in a
1265-
cluster where autoscaling is enabled. The current understanding is that further
1266-
work is needed.
1267-
1268-
To improve scale up speed, the scheduler would have to take volumes that are in
1269-
the process of being provisioned into account when deciding about other
1270-
suitable nodes. This might not be the right decision for all CSI drivers, so
1271-
further exploration and potentially an extension of the CSI API ("total
1272-
capacity") will be needed.
1273-
1274-
The approach above preserves the separation between the different
1275-
components. Simpler solutions may be possible by adding support for specific
1276-
CSI drivers into custom autoscaler binaries or into operators that control the
1277-
cluster setup.
1278-
1279-
Alternatively, additional information provided by the CSI driver might make it
1280-
possible to simplify the cluster configuration, for example by providing
1281-
machine-readable instructions for how labels should be changed.
1282-
1283-
Network attached storage doesn't need renaming of labels when cloning an
1284-
existing Node. The information published for that Node is also valid for the
1285-
fictional one. Scale up from zero however is problematic: the CSI specification
1286-
does not support listing topology segments that don't have some actual Nodes
1287-
with a running CSI driver on them. Either a CSI specification change or manual
1288-
configuration of the external-provisioner sidecar will be needed to close this
1289-
gap.
1241+
Depending on whether changes are needed in Kubernetes itself, this could be
1242+
done in a new KEP or in a design document for the autoscaler.
12901243

12911244
### Alternative solutions
12921245

@@ -1752,6 +1705,75 @@ status:
17521705
- us-west-1
17531706
```
17541707

1708+
1709+
### Generic autoscaler support
1710+
1711+
The problem of providing information about fictional nodes
1712+
can be solved by the cluster administrator. They can find out how
1713+
much storage will be made available by new nodes, for example by running
1714+
experiments, and then configure the cluster so that this information is
1715+
available to the autoscaler. This can be done with the existing
1716+
CSIStorageCapacity API for node-local storage as follows:
1717+
1718+
- When creating a fictional Node object from an existing Node in
1719+
a node group, autoscaler must modify the topology labels of the CSI
1720+
driver(s) in the cluster so that they define a new topology segment.
1721+
For example, topology.hostpath.csi/node=aks-workerpool.* has to
1722+
be replaced with topology.hostpath.csi/node=aks-workerpool-template.
1723+
Because these labels are opaque to the autoscaler, the cluster
1724+
administrator must configure these transformations, for example
1725+
via regular expression search/replace.
1726+
- For scale up from zero, a label like
1727+
topology.hostpath.csi/node=aks-workerpool-template must be added to the
1728+
configuration of the node pool.
1729+
- For each storage class, the cluster administrator can then create
1730+
CSIStorageCapacity objects that provide the capacity information for these
1731+
fictional topology segments.
1732+
- When the volume binder plugin for the scheduler runs inside the autoscaler,
1733+
it works exactly as in the scheduler and will accept nodes where the manually
1734+
created CSIStorageCapacity indicate that sufficient storage is (or rather,
1735+
will be) available.
1736+
- Because the CSI driver will not run immediately on new nodes, autoscaler has
1737+
to wait for it before considering the node ready. If it doesn't do that, it
1738+
might incorrectly scale up further because storage capacity checks will fail
1739+
for a new, unused node until the CSI driver provides CSIStorageCapacity
1740+
objects for it. This can be implemented in a generic way for all CSI drivers
1741+
by adding a readiness check to the autoscaler that compares the existing
1742+
CSIStorageCapacity objects against the expected ones for the fictional node.
1743+
1744+
A proof-of-concept of this approach is available in
1745+
https://github.com/kubernetes/autoscaler/pull/3887 and has been used
1746+
successfully to scale an Azure cluster up and down with csi-driver-host-path as
1747+
CSI driver. However, due to the lack of storage capacity modeling, scale up
1748+
happens slowly and configuring the cluster correctly is complex. Whether that
1749+
is good enough or insufficient depends on the use cases for storage in a
1750+
cluster where autoscaling is enabled. The current understanding is that further
1751+
work is needed.
1752+
1753+
To improve scale up speed, the scheduler would have to take volumes that are in
1754+
the process of being provisioned into account when deciding about other
1755+
suitable nodes. This might not be the right decision for all CSI drivers, so
1756+
further exploration and potentially an extension of the CSI API ("total
1757+
capacity") will be needed.
1758+
1759+
The approach above preserves the separation between the different
1760+
components. Simpler solutions may be possible by adding support for specific
1761+
CSI drivers into custom autoscaler binaries or into operators that control the
1762+
cluster setup.
1763+
1764+
Alternatively, additional information provided by the CSI driver might make it
1765+
possible to simplify the cluster configuration, for example by providing
1766+
machine-readable instructions for how labels should be changed.
1767+
1768+
Network attached storage doesn't need renaming of labels when cloning an
1769+
existing Node. The information published for that Node is also valid for the
1770+
fictional one. Scale up from zero however is problematic: the CSI specification
1771+
does not support listing topology segments that don't have some actual Nodes
1772+
with a running CSI driver on them. Either a CSI specification change or manual
1773+
configuration of the external-provisioner sidecar will be needed to close this
1774+
gap.
1775+
1776+
17551777
### Prior work
17561778

17571779
The [Topology-aware storage dynamic

0 commit comments

Comments
 (0)