@@ -1151,33 +1151,31 @@ such a scenario, the scheduler has to make those decisions based on
1151
1151
outdated information, in particular when making one scheduling
1152
1152
decisions affects the next decision.
1153
1153
1154
- We need to investigate:
1155
- - Whether this really is a problem in practice, i.e. identify
1156
- workloads and drivers where this problem occurs.
1157
- - Whether introducing some simple modeling of capacity helps.
1158
- - Whether prioritization of nodes helps.
1159
-
1160
- For a discussion around modeling storage capacity, see the proposal to
1161
- add [ "total capacity" to
1162
- CSI] ( https://github.com/container-storage-interface/spec/issues/301 ) .
1163
-
1164
- ### "Total available capacity" vs. "maximum volume size"
1165
-
1166
- The CSI spec around ` GetCapacityResponse.capacity ` [ is
1167
- vague] ( https://github.com/container-storage-interface/spec/issues/432 )
1168
- because it ignores fragmentation issues. The current Kubernetes API
1169
- proposal follows the design principle that Kubernetes should deviate
1170
- from the CSI spec as little as possible. It therefore directly copies
1171
- that value and thus has the same issue.
1172
-
1173
- The proposed usage (comparison of volume size against available
1174
- capacity) works either way, but having separate fields for "total
1175
- available capacity" and "maximum volume size" would be more precise
1176
- and enable additional features like even volume spreading by
1177
- prioritizing nodes based on "total available capacity"
1178
-
1179
- The goal is to clarify that first in the CSI spec and then revise the
1180
- Kubernetes API.
1154
+ [ Scale testing] ( https://github.com/kubernetes-csi/csi-driver-host-path/blob/d6d9639077691986d676984827ea4dd7ee0c5cce/docs/storage-capacity-tracking.md )
1155
+ showed that this can occur for a fake workload that generates
1156
+ pods with generic ephemeral inline volumes as quickly as possible: publishing
1157
+ CSIStorageCapacity objects was sometimes too slow, so scheduling retries were
1158
+ needed. However, this was not a problem and the test completed. The same test
1159
+ failed without storage capacity tracking because pod scheduling eventually got
1160
+ stuck. Pure chance was not good enough anymore to find nodes that still had
1161
+ free storage capacity. No cases have been reported where this was a problem for
1162
+ real workloads either.
1163
+
1164
+ Modeling remaining storage capacity in the scheduler is an approach that the
1165
+ storage community is not willing to support and considers likely to fail
1166
+ because storage is often not simply a linear amount of bytes that can be split
1167
+ up arbitrarily. For some records of that discussion see the proposal to add
1168
+ [ "total capacity" to
1169
+ CSI] ( https://github.com/container-storage-interface/spec/issues/301 ) , the newer
1170
+ [ " addition of
1171
+ ` maximum_volume_size ` ] ( https://github.com/container-storage-interface/spec/pull/470 )
1172
+ and the [ 2021 Feb 03 CSI community
1173
+ meeting] ( https://www.youtube.com/watch?v=ZB0Y05jo7-M ) .
1174
+
1175
+ Lack of storage capacity modeling will cause the autoscaler to scale up
1176
+ clusters more slowly because it cannot determine in advance that multiple new
1177
+ nodes are needed. Scaling up one node at a time is still an improvement over
1178
+ not scaling up at all.
1181
1179
1182
1180
### Prioritization of nodes
1183
1181
@@ -1193,6 +1191,11 @@ be achieved by prioritizing nodes, ideally with information about both
1193
1191
"maximum volume size" (for filtering) and "total available capacity"
1194
1192
(for prioritization).
1195
1193
1194
+ Prioritizing nodes based on storage capacity was [ discussed on
1195
+ Slack] ( https://kubernetes.slack.com/archives/C09QZFCE5/p1629251024161700 ) . The
1196
+ conclusion was to handle this as a new KEP if there is sufficient demand for
1197
+ it, which so far doesn't seem to be the case.
1198
+
1196
1199
### Integration with [ Cluster Autoscaler] ( https://github.com/kubernetes/autoscaler )
1197
1200
1198
1201
The autoscaler simulates the effect of adding more nodes to the
0 commit comments