You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Items marked with (R) are required *prior to targeting to a milestone / release*.
53
53
54
-
-[] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
55
-
-[] (R) KEP approvers have approved the KEP status as `implementable`
56
-
-[] (R) Design details are appropriately documented
57
-
-[] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
58
-
-[] e2e Tests for all Beta API Operations (endpoints)
54
+
-[x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
55
+
-[x] (R) KEP approvers have approved the KEP status as `implementable`
56
+
-[x] (R) Design details are appropriately documented
57
+
-[x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
58
+
-[x] e2e Tests for all Beta API Operations (endpoints)
59
59
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
60
60
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
61
61
-[ ] (R) Graduation criteria is in place
62
62
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
63
-
-[] (R) Production readiness review completed
64
-
-[] (R) Production readiness review approved
65
-
-[] "Implementation History" section is up-to-date for milestone
66
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
63
+
-[x] (R) Production readiness review completed
64
+
-[x] (R) Production readiness review approved
65
+
-[x] "Implementation History" section is up-to-date for milestone
66
+
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
67
67
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
68
68
69
69
<!--
@@ -507,16 +507,24 @@ the two quota mechanisms above should keep track of the usages of the same
507
507
class of devices the same way.
508
508
509
509
But currently, the extended resource quota keeps track of the devices provided
510
-
from device plugin, and DRA resource slice. The resource claim quota currently
511
-
only keeps track of the devices provided from DRA resource slice. This must be
512
-
enhanced to have it keep track of the devices from device plugin too.
513
-
514
-
As a device can be requested by resource claim, or by extended resource, the
515
-
cluster admin MUST create two quotas with the same limit on one class of devices
516
-
to effectively quota the usage of that device class.
517
-
518
-
For example, a cluster admin plans to allow 10 example.com/gpu devices in a
519
-
given namespace, they MUST create the following:
510
+
from device plugin, and DRA resource slice requested from pod's extended resource
511
+
requests. The resource claim quota currently keeps track of the devices provided
512
+
from DRA resource slice requested from resource claims.
513
+
514
+
The extended resource quota usage needs to be adjusted to account for the device
515
+
requests from resource claims. On the other side, resource claim quota has
516
+
alreadys accounted for the devices requests from pod's extendeded resources, as
517
+
scheduler would create a special resource claim for the extended resource requests.
518
+
519
+
For example, before the adjustment, the quota is as below. The explicit extended
520
+
resource quota `requests.example.com/gpu` counts 1 device (e.g. gpu-0) from
521
+
device plugin, and 1 device (e.g. gpu-1) from DRA resource slice. The implicit
- SLI Usage: Monitor workqueue depth and duration to detect resource claim processing bottlenecks. High depth or duration values indicate potential performance issues in resource claim handling that could affect pod scheduling times.
- SLI Usage: Track increases in 'unschedulable' queues to identify when extended resource availability is preventing pod scheduling. Sustained high values may indicate resource constraint issues or misconfigurations.
- SLI Usage: Monitor latencies for NodeResourcesFit and DynamicResources plugins to ensure the extended resource integration doesn't introduce performance regressions.
1085
+
- We need to monitor NodeResourcesFit because this feature implicitly affects its filtering phase.
- SLI Usage: Monitor the ratio of allocated vs. total resource claims filtered by `source="extended-resource"` to track resource utilization. A low ratio of allocated claims may indicate DRA driver or resource claim controller issues.
1096
+
- The `source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
- SLI Usage: Calculate success rate to monitor the reliability of automatic resource claim creation. High failure rates indicate potential issues with extended resource configuration.
1103
+
- Because the resource claim is created in the scheduler PreBind phase by making k8s API call, we need a different metric from `resourceclaim_controller_creates_total`.
1104
+
- The metric is incremented accordingly based on the API call outcome, either success or failure.
1002
1105
1003
1106
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
1004
1107
1005
1108
<!--
1006
1109
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
1007
1110
implementation difficulties, etc.).
1008
1111
-->
1009
-
Will be considered for beta.
1112
+
No
1010
1113
1011
1114
### Dependencies
1012
1115
@@ -1030,7 +1133,11 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
1030
1133
- Impact of its outage on the feature:
1031
1134
- Impact of its degraded performance or high-error rates on the feature:
1032
1135
-->
1033
-
No.
1136
+
The container runtime must support CDI.
1137
+
1138
+
A third-party DRA driver is required for publishing resource information and preparing resources on a node.
1139
+
1140
+
These are not new requirements from this feature, rather, they are required by DRA structured parameters.
1034
1141
1035
1142
### Scalability
1036
1143
@@ -1077,10 +1184,14 @@ The Troubleshooting section currently serves the `Playbook` role. We may conside
1077
1184
splitting it into a dedicated `Playbook` document (potentially with some monitoring
1078
1185
details). For now, we leave it here.
1079
1186
-->
1187
+
The troubleshooting section in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#troubleshooting
1188
+
still applies.
1080
1189
1081
1190
###### How does this feature react if the API server and/or etcd is unavailable?
1082
1191
1083
-
Will be considered for beta.
1192
+
The Kubernetes control plane will be down, so no new Pods get scheduled. kubelet may
1193
+
still be able to start or restart containers if it already received all the relevant
1194
+
updates (Pod, ResourceClaim, etc.).
1084
1195
1085
1196
###### What are other known failure modes?
1086
1197
@@ -1100,15 +1211,21 @@ For each of them, fill in the following information by copying the below templat
1100
1211
- Detection: inspect pod status 'Pending'
1101
1212
- Mitigations: reduce the number of devices requested in one extended resource backed by DRA requests
1102
1213
- Diagnostics: scheduler logs at level 5 show the reason for the scheduling failure.
1103
-
- Testing: Will be considered for beta.
1214
+
- Testing: this is known, determinstic failure mode due to defined system limit, i.e., DRA requests must be no more than 128 devices.
1104
1215
1105
-
###### What steps should be taken if SLOs are not being met to determine the problem?
0 commit comments