Skip to content

Commit 7cef5a6

Browse files
committed
add support for DRAExtendedResources
Signed-off-by: Alay Patel <[email protected]>
1 parent 48123e4 commit 7cef5a6

File tree

6 files changed

+491
-1
lines changed

6 files changed

+491
-1
lines changed

clusterloader2/pkg/dependency/dra/dra.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,10 @@ func (d *draDependency) Setup(config *dependency.Config) error {
7070
"Namespace": namespace,
7171
"WorkerNodeCount": getWorkerCount(config),
7272
}
73+
74+
if extendedResourceName, ok := config.Params["ExtendedResourceName"]; ok {
75+
mapping["ExtendedResourceName"] = extendedResourceName
76+
}
7377
if err := config.ClusterFramework.ApplyTemplatedManifests(
7478
manifestsFS,
7579
"manifests/*.yaml",
Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
---
22
# Source: dra-example-driver/templates/deviceclass.yaml
3-
apiVersion: resource.k8s.io/v1beta1
3+
apiVersion: resource.k8s.io/v1
44
kind: DeviceClass
55
metadata:
66
name: gpu.example.com
77
spec:
88
selectors:
99
- cel:
1010
expression: "device.driver == 'gpu.example.com'"
11+
{{- if .ExtendedResourceName}}
12+
extendedResourceName: "{{.ExtendedResourceName}}"
13+
{{- end}}
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# DRA Extended Resources Scale Test
2+
3+
## Overview
4+
5+
This test validates the performance and scalability of Kubernetes' DRA Extended Resources feature (KEP-5004). It measures how well the scheduler handles extended resource requests that are backed by Dynamic Resource Allocation (DRA), allowing applications to use familiar extended resource syntax while benefiting from DRA's dynamic allocation capabilities.
6+
7+
## What This Test Does
8+
9+
This test scenario mirrors the structure of the regular DRA test (`testing/dra/config.yaml`) but uses **extended resources syntax** instead of explicit ResourceClaims:
10+
11+
1. **Setup Phase**: Creates a DeviceClass with an `extendedResourceName` field that maps DRA devices to traditional extended resources
12+
2. **Fill Phase**: Fills the cluster to 90% utilization with long-running pods that request extended resources (e.g., `example.com/gpu: 1`)
13+
3. **Measurement Phase**: Measures performance while continuously scheduling short-lived pods at a steady rate using the same extended resource requests
14+
4. **Metrics Collection**: Collects detailed metrics on:
15+
- Pod startup latency
16+
- Job lifecycle latency
17+
- Scheduler performance metrics
18+
- DRA-specific metrics (PrepareResources/UnprepareResources latencies)
19+
- Extended resource allocation metrics
20+
21+
## Key Differences from Regular DRA Test
22+
23+
- **No ResourceClaimTemplates**: Uses DeviceClass with `extendedResourceName` instead
24+
- **Extended Resource Syntax**: Pods request `example.com/gpu: 1` in `resources.limits` instead of using `resourceClaims`
25+
- **Transparent DRA**: The scheduler automatically creates ResourceClaims behind the scenes
26+
- **Backward Compatibility**: Tests that existing extended resource workloads work with DRA
27+
28+
## Prerequisites
29+
30+
1. **Feature Gate**: Ensure `DRAExtendedResource=true` is enabled on:
31+
- kube-apiserver
32+
- kube-scheduler
33+
- kubelet
34+
35+
2. **DRA Driver**: A DRA driver must be running (installed automatically by the test)
36+
37+
3. **Prometheus**: Required for metric-based measurements
38+
39+
## Usage
40+
41+
### Environment Variables
42+
43+
```bash
44+
export CL2_MODE=Indexed # Job completion mode
45+
export CL2_NODES_PER_NAMESPACE=1 # Namespaces per node
46+
export CL2_LOAD_TEST_THROUGHPUT=20 # Fast initial fill rate
47+
export CL2_STEADY_STATE_QPS=5 # Controlled rate for measurement
48+
export CL2_JOB_RUNNING_TIME=30s # Short-lived pods runtime
49+
export CL2_LONG_JOB_RUNNING_TIME=1h # Long-running pods runtime
50+
export CL2_GPUS_PER_NODE=8 # Extended resources per node
51+
export CL2_FILL_PERCENTAGE=90 # Cluster fill percentage
52+
export CL2_EXTENDED_RESOURCE_NAME="example.com/gpu" # Extended resource name
53+
```
54+
55+
### Run the Test
56+
57+
```bash
58+
# Make sure a Prometheus stack is deployed
59+
./run-e2e.sh cluster-loader2 \
60+
--provider=kind \
61+
--kubeconfig=/root/.kube/config \
62+
--report-dir=/tmp/clusterloader2-results \
63+
--testconfig=testing/dra-extended-resources/config.yaml \
64+
--enable-prometheus-server=true \
65+
--nodes=5
66+
```
67+
68+
## Test Flow
69+
70+
### 1. DeviceClass Creation
71+
Creates a DeviceClass that maps DRA devices to extended resources:
72+
```yaml
73+
apiVersion: resource.k8s.io/v1beta2
74+
kind: DeviceClass
75+
metadata:
76+
name: gpu-extended-resource
77+
spec:
78+
selectors:
79+
- cel:
80+
expression: device.driver == 'gpu.example.com' && device.attributes['gpu.example.com'].type == 'gpu'
81+
extendedResourceName: example.com/gpu
82+
```
83+
84+
### 2. Cluster Fill (90% utilization)
85+
- Creates long-running Jobs with pods requesting `example.com/gpu: 1`
86+
- Each pod gets a single extended resource unit
87+
- Scheduler automatically creates ResourceClaims behind the scenes
88+
- Fills cluster to specified percentage (default 90%)
89+
90+
### 3. Steady State Churn
91+
- Creates short-lived Jobs at a controlled rate
92+
- Uses remaining 10% of cluster capacity
93+
- Measures scheduler performance under steady load
94+
- Tests both pod creation and cleanup performance
95+
96+
### 4. Metrics Collection
97+
Collects comprehensive metrics including:
98+
- **Standard Metrics**: Pod startup latency, scheduling throughput
99+
- **DRA Metrics**: PrepareResources/UnprepareResources latencies
100+
- **Extended Resource Metrics**: Claim creation and allocation rates
101+
- **Comparison Data**: Allows comparison with regular DRA and baseline tests
102+
103+
## Key Metrics
104+
105+
### Pod Startup Latency
106+
- **FastFillPodStartupLatency**: Startup time for initial fill pods
107+
- **ChurnPodStartupLatency**: Startup time for steady-state pods
108+
- Thresholds: p50 < 40s, p90 < 60s, p99 < 80s
109+
110+
### DRA Operation Latencies
111+
- **p99_dra_prepare_resources**: 99th percentile PrepareResources latency
112+
- **p99_dra_unprepare_operations**: 99th percentile UnprepareResources latency
113+
- **p99_dra_grpc_node_prepare_resources**: gRPC call latencies
114+
- **p99_dra_grpc_node_unprepare_resources**: gRPC cleanup latencies
115+
116+
### Extended Resource Metrics
117+
- **extended_resource_claims_created**: Number of auto-created ResourceClaims
118+
- **extended_resource_allocation_attempts**: Allocation attempt rate
119+
120+
## Comparison with Other Tests
121+
122+
| Test | Resource Type | Syntax | Purpose |
123+
|------|---------------|--------|---------|
124+
| `dra/` | ResourceClaims | `resourceClaims` section | Test explicit DRA usage |
125+
| `dra-baseline/` | CPU/Memory | `resources.requests` | Baseline without DRA |
126+
| `dra-extended-resources/` | Extended Resources | `resources.limits` | Test DRA extended resources |
127+
128+
## Expected Behavior
129+
130+
1. **Transparent Operation**: Applications work without modification
131+
2. **Automatic Claim Creation**: Scheduler creates ResourceClaims automatically
132+
3. **DRA Driver Integration**: Same DRA driver calls as explicit ResourceClaims
133+
4. **Performance**: Similar performance to explicit DRA with additional scheduler overhead for claim creation
134+
135+
## Troubleshooting
136+
137+
### Common Issues
138+
139+
1. **Feature Gate Not Enabled**
140+
- Error: Extended resource requests not creating ResourceClaims
141+
- Solution: Enable `DRAExtendedResource=true` on all components
142+
143+
2. **DeviceClass Missing**
144+
- Error: Pods stuck in Pending state
145+
- Solution: Verify DeviceClass exists with correct `extendedResourceName`
146+
147+
3. **Resource Conflicts**
148+
- Error: Both device plugin and DRA providing same extended resource
149+
- Solution: Use different extended resource names or migrate fully
150+
151+
4. **Driver Issues**
152+
- Error: PrepareResources failures
153+
- Solution: Check DRA driver logs and CDI device creation
154+
155+
### Debug Information
156+
157+
- **Pod Status**: Check for `ExtendedResourceClaimStatus` showing claim mappings
158+
- **ResourceClaim Status**: Verify allocation results in auto-created claims
159+
- **Scheduler Logs**: Enable verbosity level 5 for extended resource processing
160+
- **Kubelet Logs**: Enable verbosity level 3 for DRA manager operations
161+
162+
## Performance Expectations
163+
164+
### Compared to Regular DRA
165+
- **Similar**: DRA driver operation latencies
166+
- **Additional**: Scheduler overhead for automatic claim creation
167+
- **Benefit**: Application compatibility without code changes
168+
169+
### Compared to Baseline
170+
- **Additional**: DRA allocation and preparation overhead
171+
- **Additional**: ResourceClaim lifecycle management
172+
- **Benefit**: Dynamic device allocation and advanced scheduling
173+
174+
## Use Cases
175+
176+
1. **Migration Testing**: Validate migration from device plugins to DRA
177+
2. **Performance Validation**: Ensure extended resources don't add excessive overhead
178+
3. **Scale Testing**: Test scheduler performance with mixed resource types
179+
4. **Compatibility Testing**: Verify existing applications work with DRA backend
180+
181+
---
182+
183+
*This test validates the DRA Extended Resources feature introduced in Kubernetes 1.34 (KEP-5004) and measures its performance characteristics at scale.*

0 commit comments

Comments
 (0)