Skip to content

Commit 053b05f

Browse files
authored
Merge pull request #350 from pohly/storage-capacity-experiment
docs: show benchmark results for storage capacity tracking
2 parents b6331e2 + f053a7b commit 053b05f

File tree

1 file changed

+312
-0
lines changed

1 file changed

+312
-0
lines changed

docs/storage-capacity-tracking.md

Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
# Distributed provisioning with and without storage capacity tracking
2+
3+
csi-driver-host-path can be deployed locally on nodes with simulated storage
4+
capacity limits. The experiment below shows how Kubernetes [storage capacity
5+
tracking](https://kubernetes.io/docs/concepts/storage/storage-capacity/) helps
6+
scheduling Pods that use volumes with "wait for first consumer" provisioning.
7+
8+
## Setup
9+
10+
Clusterloader from k8s.io/perf-test master (1a46c4c54dd348) is used to
11+
generate the load.
12+
13+
The cluster was created in the Azure cloud, initially with 10 nodes:
14+
15+
```
16+
az aks create -g cloud-native --name pmem --generate-ssh-keys --node-count 10 --kubernetes-version 1.21.1
17+
```
18+
19+
csi-driver-hostpath master (76efcbf8658291e) and external-provisioner canary
20+
(2022-03-06) were used to test with the latest code in preparation for
21+
Kubernetes 1.24.
22+
23+
### Baseline without volumes
24+
25+
```
26+
go run cmd/clusterloader.go -v=3 --report-dir=/tmp/clusterloader2-no-volumes --kubeconfig=/home/pohly/.kube/config --provider=local --nodes=10 --testconfig=testing/experimental/storage/pod-startup/config.yaml --testoverrides=testing/experimental/storage/pod-startup/volume-types/genericephemeralinline/override.yaml --testoverrides=no-volumes.yaml
27+
```
28+
29+
The relevant local configuration is `no-volumes.yaml`:
30+
31+
```
32+
PODS_PER_NODE: 100
33+
NODES_PER_NAMESPACE: 10
34+
VOLUMES_PER_POD: 0
35+
VOL_SIZE: 1Gi
36+
STORAGE_CLASS: csi-hostpath-fast
37+
GATHER_METRICS: false
38+
```
39+
40+
This creates 1 namespace, 1000 pods, and all pods could run on a single
41+
node. This led to a moderate load for the cluster. Pods got spread out evenly:
42+
43+
```
44+
$ kubectl top nodes
45+
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
46+
aks-nodepool1-15818640-vmss000000 546m 28% 2244Mi 49%
47+
aks-nodepool1-15818640-vmss000001 1382m 72% 1776Mi 38%
48+
aks-nodepool1-15818640-vmss000002 445m 23% 1816Mi 39%
49+
aks-nodepool1-15818640-vmss000003 861m 45% 1852Mi 40%
50+
aks-nodepool1-15818640-vmss000004 490m 25% 1798Mi 39%
51+
aks-nodepool1-15818640-vmss000005 945m 49% 1896Mi 41%
52+
aks-nodepool1-15818640-vmss000006 1355m 71% 1956Mi 42%
53+
aks-nodepool1-15818640-vmss000007 543m 28% 1788Mi 39%
54+
aks-nodepool1-15818640-vmss000008 426m 22% 1829Mi 40%
55+
aks-nodepool1-15818640-vmss000009 721m 37% 1890Mi 41%
56+
```
57+
58+
Test results were:
59+
```
60+
<?xml version="1.0" encoding="UTF-8"?>
61+
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="446.487">
62+
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="446.483938942"></testcase>
63+
<testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101351119"></testcase>
64+
<testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="100.609226598"></testcase>
65+
<testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="114.905364201"></testcase>
66+
<testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="100.616139236"></testcase>
67+
```
68+
69+
### Without storage capacity tracking
70+
71+
For this csi-driver-hostpath was deployed with `deploy/kubernetes-distributed/deploy.sh` after patching the code:
72+
73+
```
74+
diff --git a/deploy/kubernetes-distributed/hostpath/csi-hostpath-driverinfo.yaml b/deploy/kubernetes-distributed/hostpath/csi-hostpath-driverinfo.yaml
75+
index 54d455c6..c61efec4 100644
76+
--- a/deploy/kubernetes-distributed/hostpath/csi-hostpath-driverinfo.yaml
77+
+++ b/deploy/kubernetes-distributed/hostpath/csi-hostpath-driverinfo.yaml
78+
@@ -17,5 +17,4 @@ spec:
79+
podInfoOnMount: true
80+
# No attacher needed.
81+
attachRequired: false
82+
- # alpha: opt into capacity-aware scheduling
83+
- storageCapacity: true
84+
+ storageCapacity: false
85+
diff --git a/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml b/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml
86+
index ce9abc40..e212feb6 100644
87+
--- a/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml
88+
+++ b/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml
89+
@@ -25,12 +25,12 @@ spec:
90+
serviceAccountName: csi-provisioner
91+
containers:
92+
- name: csi-provisioner
93+
- image: k8s.gcr.io/sig-storage/csi-provisioner:v3.0.0
94+
+ image: gcr.io/k8s-staging-sig-storage/csi-provisioner:canary
95+
args:
96+
- - -v=5
97+
+ - -v=3
98+
- --csi-address=/csi/csi.sock
99+
- --feature-gates=Topology=true
100+
- - --enable-capacity
101+
+ - --enable-capacity=false
102+
- --capacity-ownerref-level=0 # pod is owner
103+
- --node-deployment=true
104+
- --strict-topology=true
105+
@@ -88,7 +88,7 @@ spec:
106+
image: k8s.gcr.io/sig-storage/hostpathplugin:v1.7.3
107+
args:
108+
- --drivername=hostpath.csi.k8s.io
109+
- - --v=5
110+
+ - --v=3
111+
- --endpoint=$(CSI_ENDPOINT)
112+
- --nodeid=$(KUBE_NODE_NAME)
113+
- --capacity=slow=10Gi
114+
- --capacity=fast=100Gi
115+
```
116+
117+
In this case, the local config was:
118+
119+
```
120+
PODS_PER_NODE: 100
121+
NODES_PER_NAMESPACE: 10
122+
VOLUMES_PER_POD: 1
123+
VOL_SIZE: 1Gi
124+
STORAGE_CLASS: csi-hostpath-fast
125+
GATHER_METRICS: false
126+
POD_STARTUP_TIMEOUT: 45m
127+
```
128+
129+
The number of namespaces and pods is the same, but now they have to be
130+
distributed among all nodes because each node has storage for exactly 100
131+
volumes (`--capacity=fast=100Gi`).
132+
133+
```
134+
<?xml version="1.0" encoding="UTF-8"?>
135+
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="806.468">
136+
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="806.464585136"></testcase>
137+
<testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.100971403"></testcase>
138+
<testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="100.584344658"></testcase>
139+
<testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="414.865956542"></testcase>
140+
<testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="100.614270188"></testcase>
141+
```
142+
143+
Despite the for this particular scenario favorable even spreading, several
144+
scheduling retries are needed, with kube-scheduler often picking nodes as
145+
candidates that are already full:
146+
147+
```
148+
$ for i in `kubectl get pods | grep csi-hostpathplugin- | sed -e 's/ .*//'`; do echo "$i: $(kubectl logs $i hostpath | grep '^E.*code = ResourceExhausted desc = requested capacity .*exceeds remaining capacity for "fast"' | wc -l)"; done
149+
csi-hostpathplugin-5c74t: 24
150+
csi-hostpathplugin-8q9kf: 0
151+
csi-hostpathplugin-g4gqp: 15
152+
csi-hostpathplugin-hqxpv: 14
153+
csi-hostpathplugin-jpvj8: 10
154+
csi-hostpathplugin-l4bzm: 17
155+
csi-hostpathplugin-m54cc: 16
156+
csi-hostpathplugin-r26b4: 0
157+
csi-hostpathplugin-rnkjn: 7
158+
csi-hostpathplugin-xmvwf: 26
159+
```
160+
161+
These failed volume creation attempts are handled without deleting the affected
162+
pod. Instead, kube-scheduler tries again with a different node.
163+
164+
The situation could have been a lot worse. If kube-scheduler had preferred to
165+
pack as many pods as possible onto a single node, it would have always picked
166+
the same node because it seems to fit the pod and then the test wouldn't have
167+
completed at all.
168+
169+
### With capacity tracking
170+
171+
This is almost the default deployment, just with some tweaks to reduce logging
172+
and the newer external-provisioner. A small fix in the deploy script was needed,
173+
too.
174+
175+
```
176+
diff --git a/deploy/kubernetes-distributed/deploy.sh b/deploy/kubernetes-distributed/deploy.sh
177+
index 985e7f7a..b163aefc 100755
178+
--- a/deploy/kubernetes-distributed/deploy.sh
179+
+++ b/deploy/kubernetes-distributed/deploy.sh
180+
@@ -174,8 +174,7 @@ done
181+
# changed via CSI_PROVISIONER_TAG, so we cannot just check for the version currently
182+
# listed in the YAML file.
183+
case "$CSI_PROVISIONER_TAG" in
184+
- "") csistoragecapacities_api=v1alpha1;; # unchanged, assume version from YAML
185+
- *) csistoragecapacities_api=v1beta1;; # set, assume that it is more recent *and* a version that uses v1beta1 (https://github.com/kubernetes-csi/external-provisioner/pull/584)
186+
+ *) csistoragecapacities_api=v1beta1;; # we currently always use that version
187+
esac
188+
get_csistoragecapacities=$(kubectl get csistoragecapacities.${csistoragecapacities_api}.storage.k8s.io 2>&1 || true)
189+
if echo "$get_csistoragecapacities" | grep -q "the server doesn't have a resource type"; then
190+
diff --git a/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml b/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml
191+
index ce9abc40..88983120 100644
192+
--- a/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml
193+
+++ b/deploy/kubernetes-distributed/hostpath/csi-hostpath-plugin.yaml
194+
@@ -25,9 +25,9 @@ spec:
195+
serviceAccountName: csi-provisioner
196+
containers:
197+
- name: csi-provisioner
198+
- image: k8s.gcr.io/sig-storage/csi-provisioner:v3.0.0
199+
+ image: gcr.io/k8s-staging-sig-storage/csi-provisioner:canary
200+
args:
201+
- - -v=5
202+
+ - -v=3
203+
- --csi-address=/csi/csi.sock
204+
- --feature-gates=Topology=true
205+
- --enable-capacity
206+
@@ -88,7 +88,7 @@ spec:
207+
image: k8s.gcr.io/sig-storage/hostpathplugin:v1.7.3
208+
args:
209+
- --drivername=hostpath.csi.k8s.io
210+
- - --v=5
211+
+ - --v=3
212+
- --endpoint=$(CSI_ENDPOINT)
213+
- --nodeid=$(KUBE_NODE_NAME)
214+
- --capacity=slow=10Gi
215+
```
216+
217+
Starting pods was more than twice as fast as without storage capacity tracking
218+
(193 seconds instead of 414 seconds):
219+
220+
```
221+
<?xml version="1.0" encoding="UTF-8"?>
222+
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="544.772">
223+
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="544.769501842"></testcase>
224+
<testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.100321716"></testcase>
225+
<testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="100.602021053"></testcase>
226+
<testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="193.207935027"></testcase>
227+
<testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="100.607824368"></testcase>
228+
```
229+
230+
There were still a few failed provisioning attempts (total shown here):
231+
232+
```
233+
for i in `kubectl get pods | grep csi-hostpathplugin- | sed -e 's/ .*//'`; do kubectl logs $i hostpath ; done | grep '^E.*code = ResourceExhausted desc = requested capacity .*exceeds remaining capacity for "fast"' | wc -l
234+
27
235+
```
236+
237+
This is normal because CSIStorageCapacity might not get updated quickly enough
238+
in some cases. The key point is that this doesn't happen repeatedly for the
239+
same node.
240+
241+
242+
### 100 nodes
243+
244+
For some reason, 1.22.1 was not accepted anymore when trying to create a
245+
cluster with 100 nodes, so 1.22.6 was used instead:
246+
247+
```
248+
az aks create -g cloud-native --name pmem --generate-ssh-keys --node-count 100 --kubernetes-version 1.22.6
249+
```
250+
251+
When using the same clusterloader invocation as above with `--nodes=100`, the
252+
number of pods gets scaled up to 10000 automatically.
253+
254+
The baseline without volumes turned out to be this:
255+
256+
```
257+
<?xml version="1.0" encoding="UTF-8"?>
258+
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="3208.062">
259+
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="3208.059435154"></testcase>
260+
<testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.100575248"></testcase>
261+
<testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="1005.908420547"></testcase>
262+
<testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="1125.187490211"></testcase>
263+
<testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="1005.74259478"></testcase>
264+
```
265+
266+
Without storage capacity tracking, the test failed because pods didn't start
267+
within the 45 minute timeout:
268+
269+
```
270+
E0307 00:38:02.164402 175511 clusterloader.go:231] --------------------------------------------------------------------------------
271+
E0307 00:38:02.164418 175511 clusterloader.go:232] Test Finished
272+
E0307 00:38:02.164426 175511 clusterloader.go:233] Test: testing/experimental/storage/pod-startup/config.yaml
273+
E0307 00:38:02.164436 175511 clusterloader.go:234] Status: Fail
274+
E0307 00:38:02.164444 175511 clusterloader.go:236] Errors: [measurement call WaitForControlledPodsRunning - WaitForRunningDeployments error: 7684 objects timed out: Deployments: test-t6vzr2-3/deployment-337, test-t6vzr2-3/deployment-971,
275+
```
276+
277+
The total number of failed volume allocations was:
278+
279+
```
280+
$ for i in `kubectl get pods | grep csi-hostpathplugin- | sed -e 's/ .*//'`; do kubectl logs $i hostpath ; done | grep '^E.*code = ResourceExhausted desc = requested capacity .*exceeds remaining capacity for "fast"' | wc -l
281+
181508
282+
```
283+
284+
*Pure chance alone is not good enough anymore when the number of nodes is high.*
285+
286+
With storage capacity tracking it initially also failed:
287+
```
288+
I0307 08:36:55.412124 6877 simple_test_executor.go:145] Step "[step: 03] Waiting for deployments to be running" started
289+
W0307 08:45:52.536411 6877 reflector.go:436] *v1.PodStore: namespace(test-pu6g95-10), labelSelector(name=deployment-652): watch of *v1.Pod ended with: very short watch: *
290+
v1.PodStore: namespace(test-pu6g95-10), labelSelector(name=deployment-652): Unexpected watch close - watch lasted less than a second and no items received
291+
...
292+
```
293+
294+
There were other intermittent problems accessing the apiserver. Doing the
295+
[CSIStorageCapacity updates in a separate Kubernetes client with smaller rate
296+
limits](https://github.com/kubernetes-csi/external-provisioner/pull/711) solved
297+
this problem and the same test passed all three times that it was run:
298+
299+
```
300+
<?xml version="1.0" encoding="UTF-8"?>
301+
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="3989.135">
302+
<testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="3989.131860537"></testcase>
303+
<testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.100946346"></testcase>
304+
<testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="1005.808055111"></testcase>
305+
<testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="1775.679433562"></testcase>
306+
<testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="1005.768258827"></testcase>
307+
```
308+
309+
In this run there were 573 failed provisioning attempts.
310+
311+
The ratio between "with volumes" and "no volumes" is 1.58. That is even better
312+
than for 10 nodes where that ratio was 1.68.

0 commit comments

Comments
 (0)