Skip to content

Commit 385599f

Browse files
committed
scheduler_perf + DRA: measure pod scheduling at a steady state
The previous tests were based on scheduling pods until the cluster was full. This is a valid scenario, but not necessarily realistic. More realistic is how quickly the scheduler can schedule new pods when some old pods finished running, in particular in a cluster that is properly utilized (= almost full). To test this, pods must get created, scheduled, and then immediately deleted. This can run for a certain period of time. Scenarios with empty and full cluster have different scheduling rates. This was previously visible for DRA because the 50% percentile of the scheduling throughput was lower than the average, but one had to guess in which scenario the throughput was lower. Now this can be measured for DRA with the new SteadyStateClusterResourceClaimTemplateStructured test. The metrics collector must watch pod events to figure out how many pods got scheduled. Polling misses pods that already got deleted again. There seems to be no relevant difference in the collected metrics (SchedulingWithResourceClaimTemplateStructured/2000pods_200nodes, 6 repetitions): │ before │ after │ │ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │ 157.1 ± 0% 157.1 ± 0% ~ (p=0.329 n=6) │ before │ after │ │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50 vs base │ 48.99 ± 8% 47.52 ± 9% ~ (p=0.937 n=6) │ before │ after │ │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90 vs base │ 463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6) │ before │ after │ │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95 vs base │ 463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6) │ before │ after │ │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99 vs base │ 463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6)
1 parent 51cafb0 commit 385599f

File tree

3 files changed

+389
-22
lines changed

3 files changed

+389
-22
lines changed

test/integration/scheduler_perf/config/performance-config.yaml

Lines changed: 101 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1167,7 +1167,9 @@
11671167
maxClaimsPerNode: 20
11681168

11691169
# SchedulingWithResourceClaimTemplateStructured uses a ResourceClaimTemplate
1170-
# and dynamically creates ResourceClaim instances for each pod.
1170+
# and dynamically creates ResourceClaim instances for each pod. Node, pod and
1171+
# device counts are chosen so that the cluster gets filled up completely.
1172+
#
11711173
# The driver uses structured parameters.
11721174
- name: SchedulingWithResourceClaimTemplateStructured
11731175
featureGates:
@@ -1234,6 +1236,104 @@
12341236
measurePods: 2500
12351237
maxClaimsPerNode: 10
12361238

1239+
# SteadyStateResourceClaimTemplateStructured uses a ResourceClaimTemplate
1240+
# and dynamically creates ResourceClaim instances for each pod, but never
1241+
# more than 10 at a time. Then it waits for a pod to get scheduled
1242+
# before deleting it and creating another one.
1243+
#
1244+
# The workload determines whether there are other pods in the cluster.
1245+
#
1246+
# The driver uses structured parameters.
1247+
- name: SteadyStateClusterResourceClaimTemplateStructured
1248+
featureGates:
1249+
DynamicResourceAllocation: true
1250+
# SchedulerQueueingHints: true
1251+
workloadTemplate:
1252+
- opcode: createNodes
1253+
countParam: $nodesWithoutDRA
1254+
- opcode: createNodes
1255+
nodeTemplatePath: config/dra/node-with-dra-test-driver.yaml
1256+
countParam: $nodesWithDRA
1257+
- opcode: createResourceDriver
1258+
driverName: test-driver.cdi.k8s.io
1259+
nodes: scheduler-perf-dra-*
1260+
maxClaimsPerNodeParam: $maxClaimsPerNode
1261+
structuredParameters: true
1262+
- opcode: createAny
1263+
templatePath: config/dra/deviceclass-structured.yaml
1264+
- opcode: createAny
1265+
templatePath: config/dra/resourceclaimtemplate-structured.yaml
1266+
namespace: init
1267+
- opcode: createPods
1268+
namespace: init
1269+
countParam: $initPods
1270+
podTemplatePath: config/dra/pod-with-claim-template.yaml
1271+
- opcode: createAny
1272+
templatePath: config/dra/resourceclaimtemplate-structured.yaml
1273+
namespace: test
1274+
- opcode: createPods
1275+
namespace: test
1276+
count: 10
1277+
steadyState: true
1278+
durationParam: $duration
1279+
podTemplatePath: config/dra/pod-with-claim-template.yaml
1280+
collectMetrics: true
1281+
workloads:
1282+
- name: fast
1283+
labels: [integration-test, fast, short]
1284+
params:
1285+
# This testcase runs through all code paths without
1286+
# taking too long overall.
1287+
nodesWithDRA: 1
1288+
nodesWithoutDRA: 1
1289+
initPods: 0
1290+
maxClaimsPerNode: 10
1291+
duration: 2s
1292+
- name: empty_100nodes
1293+
params:
1294+
nodesWithDRA: 100
1295+
nodesWithoutDRA: 0
1296+
initPods: 0
1297+
maxClaimsPerNode: 2
1298+
duration: 10s
1299+
- name: empty_200nodes
1300+
params:
1301+
nodesWithDRA: 200
1302+
nodesWithoutDRA: 0
1303+
initPods: 0
1304+
maxClaimsPerNode: 2
1305+
duration: 10s
1306+
- name: empty_500nodes
1307+
params:
1308+
nodesWithDRA: 500
1309+
nodesWithoutDRA: 0
1310+
initPods: 0
1311+
maxClaimsPerNode: 2
1312+
duration: 10s
1313+
# In the "full" scenarios, the cluster can accommodate exactly one additional pod.
1314+
# These are slower because scheduling the initial pods takes time.
1315+
- name: full_100nodes
1316+
params:
1317+
nodesWithDRA: 100
1318+
nodesWithoutDRA: 0
1319+
initPods: 199
1320+
maxClaimsPerNode: 2
1321+
duration: 10s
1322+
- name: full_200nodes
1323+
params:
1324+
nodesWithDRA: 200
1325+
nodesWithoutDRA: 0
1326+
initPods: 399
1327+
maxClaimsPerNode: 2
1328+
duration: 10s
1329+
- name: full_500nodes
1330+
params:
1331+
nodesWithDRA: 500
1332+
nodesWithoutDRA: 0
1333+
initPods: 999
1334+
maxClaimsPerNode: 2
1335+
duration: 10s
1336+
12371337
# SchedulingWithResourceClaimTemplate uses ResourceClaims
12381338
# with deterministic names that are shared between pods.
12391339
# There is a fixed ratio of 1:5 between claims and pods.

0 commit comments

Comments
 (0)