|
63 | 63 | - [Example: local storage](#example-local-storage-2)
|
64 | 64 | - [Example: affect of storage classes](#example-affect-of-storage-classes-2)
|
65 | 65 | - [Example: network attached storage](#example-network-attached-storage-2)
|
| 66 | + - [Generic autoscaler support](#generic-autoscaler-support) |
66 | 67 | - [Prior work](#prior-work)
|
67 | 68 | <!-- /toc -->
|
68 | 69 |
|
@@ -1224,69 +1225,21 @@ that logic is implemented inside the CSI driver. The CSI driver doesn't know
|
1224 | 1225 | about hardware that hasn't been provisioned yet and doesn't know about
|
1225 | 1226 | autoscaling.
|
1226 | 1227 |
|
1227 |
| -This problem can be solved by the cluster administrator. They can find out how |
1228 |
| -much storage will be made available by new nodes, for example by running |
1229 |
| -experiments, and then configure the cluster so that this information is |
1230 |
| -available to the autoscaler. This can be done with the existing |
1231 |
| -CSIStorageCapacity API for node-local storage as follows: |
| 1228 | +One potential approach gets discussed [below](#generic-autoscaler-support) |
| 1229 | +under alternatives. However, it puts a considerable burden on the cluster |
| 1230 | +administrator to configure everything correctly and only scales up a cluster |
| 1231 | +one node at a time. |
1232 | 1232 |
|
1233 |
| -- When creating a fictional Node object from an existing Node in |
1234 |
| - a node group, autoscaler must modify the topology labels of the CSI |
1235 |
| - driver(s) in the cluster so that they define a new topology segment. |
1236 |
| - For example, topology.hostpath.csi/node=aks-workerpool.* has to |
1237 |
| - be replaced with topology.hostpath.csi/node=aks-workerpool-template. |
1238 |
| - Because these labels are opaque to the autoscaler, the cluster |
1239 |
| - administrator must configure these transformations, for example |
1240 |
| - via regular expression search/replace. |
1241 |
| -- For scale up from zero, a label like |
1242 |
| - topology.hostpath.csi/node=aks-workerpool-template must be added to the |
1243 |
| - configuration of the node pool. |
1244 |
| -- For each storage class, the cluster administrator can then create |
1245 |
| - CSIStorageCapacity objects that provide the capacity information for these |
1246 |
| - fictional topology segments. |
1247 |
| -- When the volume binder plugin for the scheduler runs inside the autoscaler, |
1248 |
| - it works exactly as in the scheduler and will accept nodes where the manually |
1249 |
| - created CSIStorageCapacity indicate that sufficient storage is (or rather, |
1250 |
| - will be) available. |
1251 |
| -- Because the CSI driver will not run immediately on new nodes, autoscaler has |
1252 |
| - to wait for it before considering the node ready. If it doesn't do that, it |
1253 |
| - might incorrectly scale up further because storage capacity checks will fail |
1254 |
| - for a new, unused node until the CSI driver provides CSIStorageCapacity |
1255 |
| - objects for it. This can be implemented in a generic way for all CSI drivers |
1256 |
| - by adding a readiness check to the autoscaler that compares the existing |
1257 |
| - CSIStorageCapacity objects against the expected ones for the fictional node. |
| 1233 | +To address these two problems, further work is needed to determine: |
| 1234 | +- How the CSI driver can provide information to the autoscaler |
| 1235 | + to enable simulated volume provisioning (total capacity |
| 1236 | + of a pristine simulated node, constraints for volume sizes in |
| 1237 | + the storage system). |
| 1238 | +- How to use that information to support batch scheduling in the |
| 1239 | + autoscaler. |
1258 | 1240 |
|
1259 |
| -A proof-of-concept of this approach is available in |
1260 |
| -https://github.com/kubernetes/autoscaler/pull/3887 and has been used |
1261 |
| -successfully to scale an Azure cluster up and down with csi-driver-host-path as |
1262 |
| -CSI driver. However, due to the lack of storage capacity modeling, scale up |
1263 |
| -happens slowly and configuring the cluster correctly is complex. Whether that |
1264 |
| -is good enough or insufficient depends on the use cases for storage in a |
1265 |
| -cluster where autoscaling is enabled. The current understanding is that further |
1266 |
| -work is needed. |
1267 |
| - |
1268 |
| -To improve scale up speed, the scheduler would have to take volumes that are in |
1269 |
| -the process of being provisioned into account when deciding about other |
1270 |
| -suitable nodes. This might not be the right decision for all CSI drivers, so |
1271 |
| -further exploration and potentially an extension of the CSI API ("total |
1272 |
| -capacity") will be needed. |
1273 |
| - |
1274 |
| -The approach above preserves the separation between the different |
1275 |
| -components. Simpler solutions may be possible by adding support for specific |
1276 |
| -CSI drivers into custom autoscaler binaries or into operators that control the |
1277 |
| -cluster setup. |
1278 |
| - |
1279 |
| -Alternatively, additional information provided by the CSI driver might make it |
1280 |
| -possible to simplify the cluster configuration, for example by providing |
1281 |
| -machine-readable instructions for how labels should be changed. |
1282 |
| - |
1283 |
| -Network attached storage doesn't need renaming of labels when cloning an |
1284 |
| -existing Node. The information published for that Node is also valid for the |
1285 |
| -fictional one. Scale up from zero however is problematic: the CSI specification |
1286 |
| -does not support listing topology segments that don't have some actual Nodes |
1287 |
| -with a running CSI driver on them. Either a CSI specification change or manual |
1288 |
| -configuration of the external-provisioner sidecar will be needed to close this |
1289 |
| -gap. |
| 1241 | +Depending on whether changes are needed in Kubernetes itself, this could be |
| 1242 | +done in a new KEP or in a design document for the autoscaler. |
1290 | 1243 |
|
1291 | 1244 | ### Alternative solutions
|
1292 | 1245 |
|
@@ -1752,6 +1705,75 @@ status:
|
1752 | 1705 | - us-west-1
|
1753 | 1706 | ```
|
1754 | 1707 |
|
| 1708 | + |
| 1709 | +### Generic autoscaler support |
| 1710 | + |
| 1711 | +The problem of providing information about fictional nodes |
| 1712 | +can be solved by the cluster administrator. They can find out how |
| 1713 | +much storage will be made available by new nodes, for example by running |
| 1714 | +experiments, and then configure the cluster so that this information is |
| 1715 | +available to the autoscaler. This can be done with the existing |
| 1716 | +CSIStorageCapacity API for node-local storage as follows: |
| 1717 | + |
| 1718 | +- When creating a fictional Node object from an existing Node in |
| 1719 | + a node group, autoscaler must modify the topology labels of the CSI |
| 1720 | + driver(s) in the cluster so that they define a new topology segment. |
| 1721 | + For example, topology.hostpath.csi/node=aks-workerpool.* has to |
| 1722 | + be replaced with topology.hostpath.csi/node=aks-workerpool-template. |
| 1723 | + Because these labels are opaque to the autoscaler, the cluster |
| 1724 | + administrator must configure these transformations, for example |
| 1725 | + via regular expression search/replace. |
| 1726 | +- For scale up from zero, a label like |
| 1727 | + topology.hostpath.csi/node=aks-workerpool-template must be added to the |
| 1728 | + configuration of the node pool. |
| 1729 | +- For each storage class, the cluster administrator can then create |
| 1730 | + CSIStorageCapacity objects that provide the capacity information for these |
| 1731 | + fictional topology segments. |
| 1732 | +- When the volume binder plugin for the scheduler runs inside the autoscaler, |
| 1733 | + it works exactly as in the scheduler and will accept nodes where the manually |
| 1734 | + created CSIStorageCapacity indicate that sufficient storage is (or rather, |
| 1735 | + will be) available. |
| 1736 | +- Because the CSI driver will not run immediately on new nodes, autoscaler has |
| 1737 | + to wait for it before considering the node ready. If it doesn't do that, it |
| 1738 | + might incorrectly scale up further because storage capacity checks will fail |
| 1739 | + for a new, unused node until the CSI driver provides CSIStorageCapacity |
| 1740 | + objects for it. This can be implemented in a generic way for all CSI drivers |
| 1741 | + by adding a readiness check to the autoscaler that compares the existing |
| 1742 | + CSIStorageCapacity objects against the expected ones for the fictional node. |
| 1743 | + |
| 1744 | +A proof-of-concept of this approach is available in |
| 1745 | +https://github.com/kubernetes/autoscaler/pull/3887 and has been used |
| 1746 | +successfully to scale an Azure cluster up and down with csi-driver-host-path as |
| 1747 | +CSI driver. However, due to the lack of storage capacity modeling, scale up |
| 1748 | +happens slowly and configuring the cluster correctly is complex. Whether that |
| 1749 | +is good enough or insufficient depends on the use cases for storage in a |
| 1750 | +cluster where autoscaling is enabled. The current understanding is that further |
| 1751 | +work is needed. |
| 1752 | + |
| 1753 | +To improve scale up speed, the scheduler would have to take volumes that are in |
| 1754 | +the process of being provisioned into account when deciding about other |
| 1755 | +suitable nodes. This might not be the right decision for all CSI drivers, so |
| 1756 | +further exploration and potentially an extension of the CSI API ("total |
| 1757 | +capacity") will be needed. |
| 1758 | + |
| 1759 | +The approach above preserves the separation between the different |
| 1760 | +components. Simpler solutions may be possible by adding support for specific |
| 1761 | +CSI drivers into custom autoscaler binaries or into operators that control the |
| 1762 | +cluster setup. |
| 1763 | + |
| 1764 | +Alternatively, additional information provided by the CSI driver might make it |
| 1765 | +possible to simplify the cluster configuration, for example by providing |
| 1766 | +machine-readable instructions for how labels should be changed. |
| 1767 | + |
| 1768 | +Network attached storage doesn't need renaming of labels when cloning an |
| 1769 | +existing Node. The information published for that Node is also valid for the |
| 1770 | +fictional one. Scale up from zero however is problematic: the CSI specification |
| 1771 | +does not support listing topology segments that don't have some actual Nodes |
| 1772 | +with a running CSI driver on them. Either a CSI specification change or manual |
| 1773 | +configuration of the external-provisioner sidecar will be needed to close this |
| 1774 | +gap. |
| 1775 | + |
| 1776 | + |
1755 | 1777 | ### Prior work
|
1756 | 1778 |
|
1757 | 1779 | The [Topology-aware storage dynamic
|
|
0 commit comments