Skip to content

Commit c37374e

Browse files
committed
updated dependencies, troubleshooting, and implementation histories.
1 parent 6446577 commit c37374e

File tree

1 file changed

+13
-6
lines changed
  • keps/sig-scheduling/5004-dra-extended-resource

1 file changed

+13
-6
lines changed

keps/sig-scheduling/5004-dra-extended-resource/README.md

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -530,7 +530,7 @@ spec:
530530
Provided that the device class gpu.example.com is mapped to the extended
531531
resource example.com/gpu.
532532
```yaml
533-
apiVersion: resource.k8s.io/v1beta1
533+
apiVersion: resource.k8s.io/v1
534534
kind: DeviceClass
535535
metadata:
536536
name: gpu.example.com
@@ -1099,7 +1099,11 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
10991099
- Impact of its outage on the feature:
11001100
- Impact of its degraded performance or high-error rates on the feature:
11011101
-->
1102-
No.
1102+
The container runtime must support CDI.
1103+
1104+
A third-party DRA driver is required for publishing resource information and preparing resources on a node.
1105+
1106+
These are not new requirements from this feature, rather, they are required by DRA structured parameters.
11031107

11041108
### Scalability
11051109

@@ -1146,10 +1150,14 @@ The Troubleshooting section currently serves the `Playbook` role. We may conside
11461150
splitting it into a dedicated `Playbook` document (potentially with some monitoring
11471151
details). For now, we leave it here.
11481152
-->
1153+
The troubleshooting section in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#troubleshooting
1154+
still applies.
11491155

11501156
###### How does this feature react if the API server and/or etcd is unavailable?
11511157

1152-
Will be considered for beta.
1158+
The Kubernetes control plane will be down, so no new Pods get scheduled. kubelet may
1159+
still be able to start or or restart containers if it already received all the relevant
1160+
updates (Pod, ResourceClaim, etc.).
11531161

11541162
###### What are other known failure modes?
11551163

@@ -1169,15 +1177,14 @@ For each of them, fill in the following information by copying the below templat
11691177
- Detection: inspect pod status 'Pending'
11701178
- Mitigations: reduce the number of devices requested in one extended resource backed by DRA requests
11711179
- Diagnostics: scheduler logs at level 5 show the reason for the scheduling failure.
1172-
- Testing: Will be considered for beta.
1180+
- Testing: this is known, determinstic failure mode due to defined system limit, i.e., DRA requests must be no more than 128 devices.
11731181

11741182
###### What steps should be taken if SLOs are not being met to determine the problem?
11751183

1176-
Will be considered for beta.
1177-
11781184
## Implementation History
11791185

11801186
- Kubernetes 1.34: KEP accepted.
1187+
- Kubernetes 1.35: promotion to beta.
11811188

11821189
## Drawbacks
11831190

0 commit comments

Comments
 (0)