Skip to content

Commit 78df78d

Browse files
committed
storage capacity tracking: update "drawbacks" section
1 parent 2b0db3b commit 78df78d

File tree

1 file changed

+30
-27
lines changed
  • keps/sig-storage/1472-storage-capacity-tracking

1 file changed

+30
-27
lines changed

keps/sig-storage/1472-storage-capacity-tracking/README.md

Lines changed: 30 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1151,33 +1151,31 @@ such a scenario, the scheduler has to make those decisions based on
11511151
outdated information, in particular when making one scheduling
11521152
decisions affects the next decision.
11531153

1154-
We need to investigate:
1155-
- Whether this really is a problem in practice, i.e. identify
1156-
workloads and drivers where this problem occurs.
1157-
- Whether introducing some simple modeling of capacity helps.
1158-
- Whether prioritization of nodes helps.
1159-
1160-
For a discussion around modeling storage capacity, see the proposal to
1161-
add ["total capacity" to
1162-
CSI](https://github.com/container-storage-interface/spec/issues/301).
1163-
1164-
### "Total available capacity" vs. "maximum volume size"
1165-
1166-
The CSI spec around `GetCapacityResponse.capacity` [is
1167-
vague](https://github.com/container-storage-interface/spec/issues/432)
1168-
because it ignores fragmentation issues. The current Kubernetes API
1169-
proposal follows the design principle that Kubernetes should deviate
1170-
from the CSI spec as little as possible. It therefore directly copies
1171-
that value and thus has the same issue.
1172-
1173-
The proposed usage (comparison of volume size against available
1174-
capacity) works either way, but having separate fields for "total
1175-
available capacity" and "maximum volume size" would be more precise
1176-
and enable additional features like even volume spreading by
1177-
prioritizing nodes based on "total available capacity"
1178-
1179-
The goal is to clarify that first in the CSI spec and then revise the
1180-
Kubernetes API.
1154+
[Scale testing](https://github.com/kubernetes-csi/csi-driver-host-path/blob/d6d9639077691986d676984827ea4dd7ee0c5cce/docs/storage-capacity-tracking.md)
1155+
showed that this can occur for a fake workload that generates
1156+
pods with generic ephemeral inline volumes as quickly as possible: publishing
1157+
CSIStorageCapacity objects was sometimes too slow, so scheduling retries were
1158+
needed. However, this was not a problem and the test completed. The same test
1159+
failed without storage capacity tracking because pod scheduling eventually got
1160+
stuck. Pure chance was not good enough anymore to find nodes that still had
1161+
free storage capacity. No cases have been reported where this was a problem for
1162+
real workloads either.
1163+
1164+
Modeling remaining storage capacity in the scheduler is an approach that the
1165+
storage community is not willing to support and considers likely to fail
1166+
because storage is often not simply a linear amount of bytes that can be split
1167+
up arbitrarily. For some records of that discussion see the proposal to add
1168+
["total capacity" to
1169+
CSI](https://github.com/container-storage-interface/spec/issues/301), the newer
1170+
[" addition of
1171+
`maximum_volume_size`](https://github.com/container-storage-interface/spec/pull/470)
1172+
and the [2021 Feb 03 CSI community
1173+
meeting](https://www.youtube.com/watch?v=ZB0Y05jo7-M).
1174+
1175+
Lack of storage capacity modeling will cause the autoscaler to scale up
1176+
clusters more slowly because it cannot determine in advance that multiple new
1177+
nodes are needed. Scaling up one node at a time is still an improvement over
1178+
not scaling up at all.
11811179

11821180
### Prioritization of nodes
11831181

@@ -1193,6 +1191,11 @@ be achieved by prioritizing nodes, ideally with information about both
11931191
"maximum volume size" (for filtering) and "total available capacity"
11941192
(for prioritization).
11951193

1194+
Prioritizing nodes based on storage capacity was [discussed on
1195+
Slack](https://kubernetes.slack.com/archives/C09QZFCE5/p1629251024161700). The
1196+
conclusion was to handle this as a new KEP if there is sufficient demand for
1197+
it, which so far doesn't seem to be the case.
1198+
11961199
### Integration with [Cluster Autoscaler](https://github.com/kubernetes/autoscaler)
11971200

11981201
The autoscaler simulates the effect of adding more nodes to the

0 commit comments

Comments
 (0)