Skip to content

Commit eba398a

Browse files
authored
Merge pull request #3439 from alaypatel07/dra-baseline
add dra-baseline tests with workloads without resourceclaims
2 parents 2255db7 + 93645fd commit eba398a

File tree

4 files changed

+278
-0
lines changed

4 files changed

+278
-0
lines changed
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
(8000 m – 80 m) / 8 ≈ 990 m CPU
2+
```
3+
4+
##### 2 Run the test
5+
6+
```bash
7+
# Ensure a Prometheus stack is running so metric-based measurements succeed.
8+
9+
./run-e2e.sh cluster-loader2 \
10+
--provider=kind \
11+
--kubeconfig=$HOME/.kube/config \
12+
--report-dir=/tmp/clusterloader2-results \
13+
--testconfig=testing/dra-baseline/config.yaml \
14+
--enable-prometheus-server=true \
15+
--nodes=1 # adjust to match your cluster size
16+
```
17+
18+
##### What the test does
19+
20+
1. Calculates per-pod CPU from node capacity and `CL2_PODS_PER_NODE`.
21+
2. Fills each node to ~90 % CPU utilisation with long-running Jobs.
22+
3. Waits until all fill pods are running, then gathers startup & scheduler metrics.
23+
4. Resets metrics and runs short-lived Jobs (churn) that consume the remaining capacity.
24+
5. Gathers the same metrics for the churn phase.
25+
26+
Collected measurements include PodStartupLatency and Prometheus-based scheduler metrics, allowing direct comparison to the DRA test (`testing/dra/config.yaml`).
27+
```
28+
29+
This mirrors the structure and tone of `testing/dra/README.md` while documenting the CPU-only baseline test and its new tunable parameters.
30+
31+
### Usage
32+
33+
Follow the **Getting Started** guide at `clusterloader2/docs/GETTING_STARTED.md`
34+
to bring up a kind cluster suitable for ClusterLoader² tests.
35+
36+
#### Steady-State CPU Baseline Test
37+
38+
This scenario saturates each worker node to ≈ 90 % of its *effective* CPU
39+
capacity with long-running pods and then measures scheduler performance while
40+
continuously creating short-lived pods that consume the remaining 10 %.
41+
42+
Unlike the original `testing/dra/` test, **no Device Resource Allocation
43+
(ResourceClaims) are used**—each pod simply requests CPU and memory.
44+
This provides a clean baseline for comparing DRA overhead.
45+
46+
---
47+
48+
##### 1 Environment variables
49+
50+
```bash
51+
export CL2_MODE=Indexed # Job completion mode (Indexed/NonIndexed)
52+
export CL2_NODES_PER_NAMESPACE=1 # 1 namespace per node
53+
export CL2_PODS_PER_NODE=8 # target pods per node
54+
export CL2_NODE_AVAILABLE_MILLICORES=8000 # node allocatable CPU
55+
export CL2_SYSTEM_USED_MILLICORES=80 # CPU already used by system pods
56+
export CL2_FILL_PERCENTAGE=90 # % of capacity for long-running pods
57+
export CL2_LOAD_TEST_THROUGHPUT=20 # QPS for the fast fill phase
58+
export CL2_STEADY_STATE_QPS=5 # QPS for steady-state churn
59+
export CL2_LONG_JOB_RUNNING_TIME=1h # runtime of long-running pods
60+
export CL2_JOB_RUNNING_TIME=30s # runtime of short-lived pods
61+
export CL2_POD_MEMORY=128Mi # memory request per pod
62+
```
63+
64+
With the defaults above, each pod will request
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
{{$MODE := DefaultParam .CL2_MODE "Indexed"}}
2+
{{$NODES_PER_NAMESPACE := MinInt .Nodes (DefaultParam .CL2_NODES_PER_NAMESPACE 100)}}
3+
{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}
4+
{{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}}
5+
{{$NODE_AVAILABLE_MILLICORES := DefaultParam .CL2_NODE_AVAILABLE_MILLICORES 8000}}
6+
{{$SYSTEM_USED_MILLICORES := DefaultParam .CL2_SYSTEM_USED_MILLICORES 80}}
7+
{{$PODS_PER_NODE := DefaultParam .CL2_PODS_PER_NODE 8}}
8+
{{$POD_MEMORY := DefaultParam .CL2_POD_MEMORY "128Mi"}}
9+
{{$CPU_AVAILABLE_PER_NODE := SubtractInt $NODE_AVAILABLE_MILLICORES $SYSTEM_USED_MILLICORES}}
10+
{{$POD_CPU_MILLICORES := DivideInt $CPU_AVAILABLE_PER_NODE $PODS_PER_NODE}}
11+
12+
13+
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
14+
15+
{{$cpusPerNode := DefaultParam .CL2_CPUS_PER_NODE 8}}
16+
{{$totalCPUs := MultiplyInt $cpusPerNode .Nodes}}
17+
18+
{{$fillPercentage := DefaultParam .CL2_FILL_PERCENTAGE 90}}
19+
{{$fillPodsCount := DivideInt (MultiplyInt $totalCPUs $fillPercentage) 100}}
20+
{{$fillPodsPerNamespace := DivideInt $fillPodsCount $namespaces}}
21+
{{$longJobSize := 1}}
22+
{{$longJobRunningTime := DefaultParam .CL2_LONG_JOB_RUNNING_TIME "1h"}}
23+
24+
{{$smallJobPodsCount := SubtractInt $totalCPUs (MultiplyInt $fillPodsPerNamespace $namespaces)}}
25+
{{$smallJobsPerNamespace := DivideInt $smallJobPodsCount $namespaces}}
26+
{{$smallJobSize := 1}}
27+
{{$smallJobCompletions := 10}}
28+
{{$jobRunningTime := DefaultParam .CL2_JOB_RUNNING_TIME "30s"}}
29+
30+
name: dra-baseline
31+
32+
namespace:
33+
number: {{$namespaces}}
34+
35+
tuningSets:
36+
- name: FastFill
37+
qpsLoad:
38+
qps: {{$LOAD_TEST_THROUGHPUT}}
39+
- name: SteadyState
40+
qpsLoad:
41+
qps: {{$STEADY_STATE_QPS}}
42+
43+
steps:
44+
- name: Start measurements
45+
measurements:
46+
- Identifier: WaitForFinishedJobs
47+
Method: WaitForFinishedJobs
48+
Params:
49+
action: start
50+
labelSelector: job-type = short-lived
51+
- Identifier: WaitForControlledPodsRunning
52+
Method: WaitForControlledPodsRunning
53+
Params:
54+
action: start
55+
apiVersion: batch/v1
56+
kind: Job
57+
labelSelector: job-type = long-running
58+
operationTimeout: 120s
59+
- Identifier: FastFillPodStartupLatency
60+
Method: PodStartupLatency
61+
Params:
62+
action: start
63+
labelSelector: job-type = long-running
64+
- Identifier: FastFillSchedulingMetrics
65+
Method: PrometheusSchedulingMetrics
66+
Params:
67+
action: start
68+
69+
- name: Fill cluster to {{$fillPercentage}}% utilisation
70+
phases:
71+
- namespaceRange:
72+
min: 1
73+
max: {{$namespaces}}
74+
replicasPerNamespace: {{$fillPodsPerNamespace}}
75+
tuningSet: FastFill
76+
objectBundle:
77+
- basename: long-running
78+
objectTemplatePath: "long-running-job.yaml"
79+
templateFillMap:
80+
Replicas: {{$longJobSize}}
81+
Mode: {{$MODE}}
82+
Sleep: {{$longJobRunningTime}}
83+
CPUMilli: {{$POD_CPU_MILLICORES}}
84+
Memory: {{$POD_MEMORY}}
85+
86+
- name: Wait for fill pods to be running
87+
measurements:
88+
- Identifier: WaitForControlledPodsRunning
89+
Method: WaitForControlledPodsRunning
90+
Params:
91+
action: gather
92+
labelSelector: job-type = long-running
93+
timeout: 15m
94+
95+
- name: Gather measurements for long running pods
96+
measurements:
97+
- Identifier: FastFillSchedulingMetrics
98+
Method: PrometheusSchedulingMetrics
99+
Params:
100+
action: gather
101+
- Identifier: FastFillPodStartupLatency
102+
Method: PodStartupLatency
103+
Params:
104+
action: gather
105+
106+
- name: reset metrics for steady state churn
107+
measurements:
108+
- Identifier: ChurnSchedulingMetrics
109+
Method: PrometheusSchedulingMetrics
110+
Params:
111+
action: start
112+
- Identifier: ChurnPodStartupLatency
113+
Method: PodStartupLatency
114+
Params:
115+
action: start
116+
labelSelector: job-type = short-lived
117+
perc50Threshold: 40s
118+
perc90Threshold: 60s
119+
perc99Threshold: 80s
120+
121+
- name: Create steady state {{$MODE}} jobs
122+
phases:
123+
- namespaceRange:
124+
min: 1
125+
max: {{$namespaces}}
126+
replicasPerNamespace: {{$smallJobsPerNamespace}}
127+
tuningSet: SteadyState
128+
objectBundle:
129+
- basename: small
130+
objectTemplatePath: "job.yaml"
131+
templateFillMap:
132+
Replicas: {{$smallJobSize}}
133+
CompletionReplicas: {{$smallJobCompletions}}
134+
Mode: {{$MODE}}
135+
Sleep: {{$jobRunningTime}}
136+
CPUMilli: {{$POD_CPU_MILLICORES}}
137+
Memory: {{$POD_MEMORY}}
138+
139+
- name: Wait for short-lived jobs to finish
140+
measurements:
141+
- Identifier: WaitForFinishedJobs
142+
Method: WaitForFinishedJobs
143+
Params:
144+
action: gather
145+
labelSelector: job-type = short-lived
146+
timeout: 15m
147+
148+
- name: Measure scheduler metrics
149+
measurements:
150+
- Identifier: ChurnSchedulingMetrics
151+
Method: PrometheusSchedulingMetrics
152+
Params:
153+
action: gather
154+
- Identifier: ChurnPodStartupLatency
155+
Method: PodStartupLatency
156+
Params:
157+
action: gather
158+
perc50Threshold: 40s
159+
perc90Threshold: 60s
160+
perc99Threshold: 80s
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: {{.Name}}
5+
labels:
6+
group: baseline-job
7+
job-type: short-lived
8+
spec:
9+
parallelism: {{.Replicas}}
10+
completions: {{.CompletionReplicas}}
11+
completionMode: {{.Mode}}
12+
ttlSecondsAfterFinished: 300
13+
template:
14+
metadata:
15+
labels:
16+
group: baseline-pod
17+
job-type: short-lived
18+
spec:
19+
restartPolicy: Never
20+
containers:
21+
- name: {{.Name}}
22+
image: gcr.io/k8s-staging-perf-tests/sleep:v0.0.3
23+
args: ["{{.Sleep}}"]
24+
resources:
25+
requests:
26+
cpu: "{{.CPUMilli}}m"
27+
memory: "{{.Memory}}"
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: {{.Name}}
5+
labels:
6+
group: baseline-job
7+
job-type: long-running
8+
spec:
9+
parallelism: {{.Replicas}}
10+
completions: {{.Replicas}}
11+
completionMode: {{.Mode}}
12+
activeDeadlineSeconds: 86400 # 24 h
13+
template:
14+
metadata:
15+
labels:
16+
group: baseline-pod
17+
job-type: long-running
18+
spec:
19+
restartPolicy: Never
20+
containers:
21+
- name: {{.Name}}
22+
image: gcr.io/k8s-staging-perf-tests/sleep:v0.0.3
23+
args: ["{{.Sleep}}"]
24+
resources:
25+
requests:
26+
cpu: "{{.CPUMilli}}m"
27+
memory: "{{.Memory}}"

0 commit comments

Comments
 (0)