@@ -1189,9 +1189,54 @@ based on storage capacity:
1189
1189
to available storage and thus could run on a new node, the
1190
1190
simulation may decide otherwise.
1191
1191
1192
- It may be possible to solve this by pre-configuring some information
1193
- (local storage capacity of future nodes and their CSI topology). This
1194
- needs to be explored further.
1192
+ This gets further complicated by the independent development of CSI drivers,
1193
+ autoscaler, and cloud provider: autoscaler and cloud provider don't know which
1194
+ kinds of volumes a CSI driver will be able to make available on nodes because
1195
+ that logic is implemented inside the CSI driver. The CSI driver doesn't know
1196
+ about hardware that hasn't been provisioned yet and doesn't know about
1197
+ autoscaling.
1198
+
1199
+ This problem can be solved by the cluster administrator. They can find out how
1200
+ much storage will be made available by new nodes, for example by running
1201
+ experiments, and then configure the cluster so that this information is
1202
+ available to the autoscaler. This can be done with the existing
1203
+ CSIStorageCapacity API as follows:
1204
+
1205
+ - When creating a fictional Node object from an existing Node in
1206
+ a node group, autoscaler must modify the topology labels of the CSI
1207
+ driver(s) in the cluster so that they define a new topology segment.
1208
+ For example, topology.hostpath.csi/node=aks-workerpool.* has to
1209
+ be replaced with topology.hostpath.csi/node=aks-workerpool-template.
1210
+ Because these labels are opaque to the autoscaler, the cluster
1211
+ administrator must configure these transformations, for example
1212
+ via regular expression search/replace.
1213
+ - For scale up from zero, a label like
1214
+ topology.hostpath.csi/node=aks-workerpool-template must be added to the
1215
+ configuration of the node pool.
1216
+ - For each storage class, the cluster administrator can then create
1217
+ CSIStorageCapacity objects that provide the capacity information for these
1218
+ fictional topology segments.
1219
+ - When the volume binder plugin for the scheduler runs inside the autoscaler,
1220
+ it works exactly as in the scheduler and will accept nodes where the manually
1221
+ created CSIStorageCapacity indicate that sufficient storage is (or rather,
1222
+ will be) available.
1223
+ - Because the CSI driver will not run immediately on new nodes, autoscaler has
1224
+ to wait for it before considering the node ready. If it doesn't do that, it
1225
+ might incorrectly scale up further because storage capacity checks will fail
1226
+ for a new, unused node until the CSI driver provides CSIStorageCapacity
1227
+ objects for it. This can be implemented in a generic way for all CSI drivers
1228
+ by adding a readiness check to the autoscaler that compares the existing
1229
+ CSIStorageCapacity objects against the expected ones for the fictional node.
1230
+
1231
+ A proof-of-concept of this approach is available in
1232
+ https://github.com/kubernetes/autoscaler/pull/3887 and has been used
1233
+ successfully to scale an Azure cluster up and down with csi-driver-host-path as
1234
+ CSI driver.
1235
+
1236
+ The approach above preserves the separation between the different
1237
+ components. Simpler solutions may be possible by adding support for specific
1238
+ CSI drivers into custom autoscaler binaries or into operators that control the
1239
+ cluster setup.
1195
1240
1196
1241
### Alternative solutions
1197
1242
0 commit comments