Skip to content

Commit 6ba73a6

Browse files
authored
Merge pull request #5537 from yliaog/roll
KEP-5004: DRA Extended Resource: graduate to beta
2 parents a99c799 + fa157d2 commit 6ba73a6

File tree

3 files changed

+178
-52
lines changed

3 files changed

+178
-52
lines changed

keps/prod-readiness/sig-scheduling/5004.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 5004
55
alpha:
66
approver: "@johnbelamaric"
7+
beta:
8+
approver: "@johnbelamaric"

keps/sig-scheduling/5004-dra-extended-resource/README.md

Lines changed: 166 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -51,19 +51,19 @@
5151

5252
Items marked with (R) are required *prior to targeting to a milestone / release*.
5353

54-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
55-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
56-
- [ ] (R) Design details are appropriately documented
57-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
58-
- [ ] e2e Tests for all Beta API Operations (endpoints)
54+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
55+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
56+
- [x] (R) Design details are appropriately documented
57+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
58+
- [x] e2e Tests for all Beta API Operations (endpoints)
5959
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
6060
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
6161
- [ ] (R) Graduation criteria is in place
6262
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
63-
- [ ] (R) Production readiness review completed
64-
- [ ] (R) Production readiness review approved
65-
- [ ] "Implementation History" section is up-to-date for milestone
66-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
63+
- [x] (R) Production readiness review completed
64+
- [x] (R) Production readiness review approved
65+
- [x] "Implementation History" section is up-to-date for milestone
66+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
6767
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6868

6969
<!--
@@ -507,16 +507,24 @@ the two quota mechanisms above should keep track of the usages of the same
507507
class of devices the same way.
508508

509509
But currently, the extended resource quota keeps track of the devices provided
510-
from device plugin, and DRA resource slice. The resource claim quota currently
511-
only keeps track of the devices provided from DRA resource slice. This must be
512-
enhanced to have it keep track of the devices from device plugin too.
513-
514-
As a device can be requested by resource claim, or by extended resource, the
515-
cluster admin MUST create two quotas with the same limit on one class of devices
516-
to effectively quota the usage of that device class.
517-
518-
For example, a cluster admin plans to allow 10 example.com/gpu devices in a
519-
given namespace, they MUST create the following:
510+
from device plugin, and DRA resource slice requested from pod's extended resource
511+
requests. The resource claim quota currently keeps track of the devices provided
512+
from DRA resource slice requested from resource claims.
513+
514+
The extended resource quota usage needs to be adjusted to account for the device
515+
requests from resource claims. On the other side, resource claim quota has
516+
alreadys accounted for the devices requests from pod's extendeded resources, as
517+
scheduler would create a special resource claim for the extended resource requests.
518+
519+
For example, before the adjustment, the quota is as below. The explicit extended
520+
resource quota `requests.example.com/gpu` counts 1 device (e.g. gpu-0) from
521+
device plugin, and 1 device (e.g. gpu-1) from DRA resource slice. The implicit
522+
extended resource quota `request.deviceclass.resource.kubernetes.io/mygpuclass`
523+
counts 1 device (e.g. gpu-2) from DRA resource slice. The resource claim quota
524+
`gpu.example.com.deviceclass.resource.k8s.io/devices` counts 1 device (e.g. gpu-3)
525+
from a pod resource claim, and 1 device (e.g. gpu-4) from a resource claim template,
526+
in addition it also counts gpu-1 and gpu-2 in, as scheduler generates extended
527+
resource claims for them.
520528

521529
```yaml
522530
apiVersion: v1
@@ -526,25 +534,49 @@ metadata:
526534
spec:
527535
hard:
528536
requests.example.com/gpu: 10
537+
request.deviceclass.resource.kubernetes.io/mygpuclass: 10
529538
gpu.example.com.deviceclass.resource.k8s.io/devices: 10
539+
used:
540+
requests.example.com/gpu: 2
541+
request.deviceclass.resource.kubernetes.io/mygpuclass: 1
542+
gpu.example.com.deviceclass.resource.k8s.io/devices: 4
530543
```
531544

532-
Provided that the device class gpu.example.com is mapped to the extended
545+
Provided that the device class mygpuclass is mapped to the extended
533546
resource example.com/gpu.
534547
```yaml
535-
apiVersion: resource.k8s.io/v1beta1
548+
apiVersion: resource.k8s.io/v1
536549
kind: DeviceClass
537550
metadata:
538-
name: gpu.example.com
551+
name: mygpusclass
539552
spec:
540553
extendedResourceName: example.com/gpu
541554
```
542555

543-
Resource Quota controller reconciles away the differences if any between the
544-
usage of the two quota, and ensures their usage are always kept the same. For
545-
that, the controller needs to have the permission to list the device classes
546-
in the cluster to establish the mapping between device class and extended
547-
resource.
556+
For the same example, the explicit extended resource quota `requests.example.com/gpu`
557+
needs to be adjusted to count in the devices requested from implicit extended resource
558+
(e.g. gpu-2) and from resoure claims (e.g gpu-3 and gpu-4). The implicit extended
559+
resource quota `request.deviceclass.resource.kubernetes.io/mygpuclass` needs to be
560+
adjusted to count in the devices requested from resource claims (e.g. gpu-3 and gpu-4),
561+
and the DRA devices requested from explicit extended resources (e.g. gpu-1), but
562+
not the device plugin devices (e.g. gpu-0). The adjusted quota is as below.
563+
564+
565+
```yaml
566+
apiVersion: v1
567+
kind: ResourceQuota
568+
metadata:
569+
name: gpu
570+
spec:
571+
hard:
572+
requests.example.com/gpu: 10
573+
request.deviceclass.resource.kubernetes.io/mygpuclass: 10
574+
gpu.example.com.deviceclass.resource.k8s.io/devices: 10
575+
used:
576+
requests.example.com/gpu: 5
577+
request.deviceclass.resource.kubernetes.io/mygpuclass: 4
578+
gpu.example.com.deviceclass.resource.k8s.io/devices: 4
579+
```
548580

549581
### Scheduling for Extended Resource backed by DRA
550582

@@ -601,7 +633,7 @@ extended resource backed by DRA requests.
601633
This registers all cluster events that might make an unschedulable pod schedulable,
602634
like finishing the allocation of a claim, or resource slice updates.
603635

604-
The existing dynamicresource plugin has registered almost all the events needed or
636+
The existing dynamicresource plugin has registered almost all the events needed for
605637
extended resource backed by DRA, with one addition `framework.UpdateNodeAllocatable`
606638
for node action.
607639

@@ -817,10 +849,9 @@ ensure `ExtendedResourceName`s are handled by the scheduler as described in this
817849

818850
#### Beta
819851

820-
- Reevaluate where to create the special resource claim, in scheduler or some
821-
other controller, based on feedback from Alpha and the nomination concept.
852+
- The basic scoring in NodeResourcesFit has to be implemented and that the queueing hints have to work efficiently.
853+
- Keep the Alpha behavior to create the special resource claim in scheduler.
822854
- Gather feedback from developers and surveys
823-
- 3 examples of vendors making use of the extensions proposed in this KEP
824855
- Scalability tests that mirror real-world usage as determined by user feedback
825856
- Additional tests are in Testgrid and linked in KEP
826857
- All functionality completed
@@ -903,15 +934,37 @@ feature flags will be enabled on some API servers and not others during the
903934
rollout. Similarly, consider large clusters and how enablement/disablement
904935
will rollout across nodes.
905936
-->
906-
Will be considered for beta.
937+
Workloads that do not use the DRA Extended Resource feature should not be impacted,
938+
since the functionality is unchanged.
939+
940+
If the feature is being used in pods before support for it has been fully rolled out
941+
across the cluster, api server, scheduler in control plane, and kubelet in nodes, it
942+
can cause a failure to schedule pods or a failure to run the pods on the nodes.
943+
This will not affect already running workloads unless they have to be restarted.
944+
945+
Device plugin drivers can be replaced with DRA drivers for the same devices on a
946+
per-node basis, one node at a time.
907947

908948
###### What specific metrics should inform a rollback?
909949

910950
<!--
911951
What signals should users be paying attention to when the feature is young
912952
that might indicate a serious problem?
913953
-->
914-
Will be considered for beta.
954+
One indicator are unexpected restarts of the cluster control plane components
955+
(kube-scheduler, apiserver) or kubelet.
956+
957+
If the scheduler_pending_pods metric in the kube-scheduler suddenly increases, it can
958+
suggest that pods are no longer getting scheduled which might be due to a problem with
959+
the DRA scheduler plugin. Another are an increase in the number of pods that fail to start,
960+
as indicated by the kubelet_started_containers_errors_total metric.
961+
962+
If the node.status.Capacity for the extended resources for the devices do not decrease to zero,
963+
or a pod fails to be scheduled, or run on the node, it may indicate that the device plugin driver
964+
on the node for the devices is not properly replaced by the DRA driver.
965+
966+
In all cases further analysis of logs and pod events is needed to determine whether
967+
errors are related to this feature.
915968

916969
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
917970

@@ -920,14 +973,17 @@ Describe manual testing that was done and the outcomes.
920973
Longer term, we may want to require automated upgrade/rollback tests, but we
921974
are missing a bunch of machinery and tooling and can't do that now.
922975
-->
923-
Will be considered for beta.
976+
This will be covered by automated tests before transition to beta by bringing up a KinD cluster and
977+
changing the feature gate for individual components.
978+
979+
Roundtripping of API types is covered by unit tests.
924980

925981
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
926982

927983
<!--
928984
Even if applying deprecation policies, they may still surprise some users.
929985
-->
930-
Will be considered for beta.
986+
No
931987

932988
### Monitoring Requirements
933989

@@ -937,7 +993,6 @@ This section must be completed when targeting beta to a release.
937993
For GA, this section is required: approvers should be able to confirm the
938994
previous answers based on experience in the field.
939995
-->
940-
Will be considered for beta.
941996

942997
###### How can an operator determine if the feature is in use by workloads?
943998

@@ -946,7 +1001,14 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
9461001
checking if there are objects with field X set) may be a last resort. Avoid
9471002
logs or events for this purpose.
9481003
-->
949-
Will be considered for beta.
1004+
`kube_pod_resource_limit` and `kube_pod_resource_request`
1005+
(label: `namespace`, `pod`, `node`, `scheduler`, `priority`, **`resource`**, `unit`)
1006+
can be used to determine if the feature is in use by workloads though it doesn't differentiate
1007+
between extended resources backed by DRA or device plugin.
1008+
1009+
We will add a new `source` label to `resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`),
1010+
which can determine if the resource claim is created by extended resource or resource claim template.
1011+
It should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
9501012

9511013
###### How can someone using this feature know that it is working for their instance?
9521014

@@ -959,14 +1021,16 @@ and operation of this feature.
9591021
Recall that end users cannot usually observe component logs or access metrics.
9601022

9611023
- [ ] Events
962-
- Event Reason:
1024+
- Event Reason:
9631025
- [ ] API .status
964-
- Condition name:
965-
- Other field:
1026+
- Condition name:
1027+
- Other field:
9661028
- [ ] Other (treat as last resort)
9671029
- Details:
9681030
-->
969-
Will be considered for beta.
1031+
- [x] API .status
1032+
- Other field: Pod's `.status.extendedResourceClaimStatus` will have a list of resource claims that are created for
1033+
DRA extended resources.
9701034

9711035
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
9721036

@@ -984,7 +1048,11 @@ high level (needs more precise definitions) those may be things like:
9841048
These goals will help you determine what you need to measure (SLIs) in the next
9851049
question.
9861050
-->
987-
Will be considered for beta.
1051+
1052+
Existing DRA and kube-scheduler SLOs continue to apply and must be maintained.
1053+
Pod scheduling duration with this feature should be as fast as existing DRA.
1054+
Since this feature implicitly affects the filtering phase of the NodeResourcesFit plugin,
1055+
the performance should be similar with no visible degradation compared to the baseline scheduling performance.
9881056

9891057
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
9901058

@@ -998,15 +1066,50 @@ Pick one more of these and delete the rest.
9981066
- [ ] Other (treat as last resort)
9991067
- Details:
10001068
-->
1001-
Will be considered for beta.
1069+
1070+
- [x] Metrics
1071+
Values of each label are not exhaustive; we are providing some example values that are related to this feature's SLI.
1072+
**Existing metrics:**
1073+
- Metric name: workqueue
1074+
- Type: Gauge/Counter (multiple workqueue metrics)
1075+
- Labels: `name` ("resource_claim")
1076+
- SLI Usage: Monitor workqueue depth and duration to detect resource claim processing bottlenecks. High depth or duration values indicate potential performance issues in resource claim handling that could affect pod scheduling times.
1077+
- Metric name: scheduler_pending_pods
1078+
- Type: Gauge
1079+
- Labels: `queue` ("active", "backoff", "unschedulable", "gated")
1080+
- SLI Usage: Track increases in 'unschedulable' queues to identify when extended resource availability is preventing pod scheduling. Sustained high values may indicate resource constraint issues or misconfigurations.
1081+
- Metric name: scheduler_plugin_execution_duration_seconds
1082+
- Type: Histogram
1083+
- Labels: `plugin` ("NodeResourcesFit", "DynamicResources"), `extension_point`, `status`
1084+
- SLI Usage: Monitor latencies for NodeResourcesFit and DynamicResources plugins to ensure the extended resource integration doesn't introduce performance regressions.
1085+
- We need to monitor NodeResourcesFit because this feature implicitly affects its filtering phase.
1086+
- Metric name: scheduler_pod_scheduling_sli_duration_seconds
1087+
- Type: Histogram
1088+
- Labels: `attempts`
1089+
- SLI Usage: Track end-to-end scheduling performance for pods using extended resources.
1090+
1091+
**Updating metrics:**
1092+
- Metric name: resourceclaim_controller_resource_claims
1093+
- Type: Gauge
1094+
- Labels: `admin_access`, `allocated`, `source` ("extended-resource", "resource-claim-template")
1095+
- SLI Usage: Monitor the ratio of allocated vs. total resource claims filtered by `source="extended-resource"` to track resource utilization. A low ratio of allocated claims may indicate DRA driver or resource claim controller issues.
1096+
- The `source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
1097+
1098+
**New metrics:**
1099+
- Metric name: scheduler_resourceclaim_creates_total
1100+
- Type: Counter
1101+
- Labels: `status` ("failure", "success")
1102+
- SLI Usage: Calculate success rate to monitor the reliability of automatic resource claim creation. High failure rates indicate potential issues with extended resource configuration.
1103+
- Because the resource claim is created in the scheduler PreBind phase by making k8s API call, we need a different metric from `resourceclaim_controller_creates_total`.
1104+
- The metric is incremented accordingly based on the API call outcome, either success or failure.
10021105

10031106
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10041107

10051108
<!--
10061109
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
10071110
implementation difficulties, etc.).
10081111
-->
1009-
Will be considered for beta.
1112+
No
10101113

10111114
### Dependencies
10121115

@@ -1030,7 +1133,11 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
10301133
- Impact of its outage on the feature:
10311134
- Impact of its degraded performance or high-error rates on the feature:
10321135
-->
1033-
No.
1136+
The container runtime must support CDI.
1137+
1138+
A third-party DRA driver is required for publishing resource information and preparing resources on a node.
1139+
1140+
These are not new requirements from this feature, rather, they are required by DRA structured parameters.
10341141

10351142
### Scalability
10361143

@@ -1077,10 +1184,14 @@ The Troubleshooting section currently serves the `Playbook` role. We may conside
10771184
splitting it into a dedicated `Playbook` document (potentially with some monitoring
10781185
details). For now, we leave it here.
10791186
-->
1187+
The troubleshooting section in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#troubleshooting
1188+
still applies.
10801189

10811190
###### How does this feature react if the API server and/or etcd is unavailable?
10821191

1083-
Will be considered for beta.
1192+
The Kubernetes control plane will be down, so no new Pods get scheduled. kubelet may
1193+
still be able to start or restart containers if it already received all the relevant
1194+
updates (Pod, ResourceClaim, etc.).
10841195

10851196
###### What are other known failure modes?
10861197

@@ -1100,15 +1211,21 @@ For each of them, fill in the following information by copying the below templat
11001211
- Detection: inspect pod status 'Pending'
11011212
- Mitigations: reduce the number of devices requested in one extended resource backed by DRA requests
11021213
- Diagnostics: scheduler logs at level 5 show the reason for the scheduling failure.
1103-
- Testing: Will be considered for beta.
1214+
- Testing: this is known, determinstic failure mode due to defined system limit, i.e., DRA requests must be no more than 128 devices.
11041215

1105-
###### What steps should be taken if SLOs are not being met to determine the problem?
1216+
- [API server priority & fairness limits extended resource claim creation requests]
1217+
- Detection: inspect metric scheduler_resourceclaim_creates_total, and API server priority & fairness limits
1218+
- Mitigations: adjust API sever priority and fairness limits if too low, to allow extended resource claim creation
1219+
- Diagnostics: API server and scheduler logs level 5 show the reason for the extended resource claim creation failure.
1220+
- Testing: creating pods with DRA extended resource requests at high rate, and at the same time, API server
1221+
priority and fairness limit too low, could trigger extended resource claim creation failure at scheduler.
11061222

1107-
Will be considered for beta.
1223+
###### What steps should be taken if SLOs are not being met to determine the problem?
11081224

11091225
## Implementation History
11101226

11111227
- Kubernetes 1.34: KEP accepted.
1228+
- Kubernetes 1.35: promotion to beta.
11121229

11131230
## Drawbacks
11141231

0 commit comments

Comments
 (0)