Skip to content

Commit 3caec60

Browse files
bitokuyliaog
authored andcommitted
Add monitoring requirements in KEP-5004
Signed-off-by: Ayato Tokubi <[email protected]>
1 parent 665f217 commit 3caec60

File tree

1 file changed

+22
-6
lines changed
  • keps/sig-scheduling/5004-dra-extended-resource

1 file changed

+22
-6
lines changed

keps/sig-scheduling/5004-dra-extended-resource/README.md

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -599,7 +599,7 @@ extended resource backed by DRA requests.
599599
This registers all cluster events that might make an unschedulable pod schedulable,
600600
like finishing the allocation of a claim, or resource slice updates.
601601

602-
The existing dynamicresource plugin has registered almost all the events needed or
602+
The existing dynamicresource plugin has registered almost all the events needed for
603603
extended resource backed by DRA, with one addition `framework.UpdateNodeAllocatable`
604604
for node action.
605605

@@ -969,7 +969,13 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
969969
checking if there are objects with field X set) may be a last resort. Avoid
970970
logs or events for this purpose.
971971
-->
972-
Will be considered for beta.
972+
`kube_pod_resource_limit` and `kube_pod_resource_request`
973+
(label: `namespace`, `pod`, `node`, `scheduler`, `priority`, **`resource`**, `unit`)
974+
can be used to determine if the feature is in use by workloads though it doesn't differentiate
975+
between extended resources backed by DRA or device plugin.
976+
977+
`resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`, `source`)
978+
should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
973979

974980
###### How can someone using this feature know that it is working for their instance?
975981

@@ -989,7 +995,9 @@ Recall that end users cannot usually observe component logs or access metrics.
989995
- [ ] Other (treat as last resort)
990996
- Details:
991997
-->
992-
Will be considered for beta.
998+
- [ x ] API .status
999+
- Other field: `.status.extendedResourceClaimStatus` will have a list of resource claims that are created for
1000+
DRA extended resources.
9931001

9941002
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
9951003

@@ -1007,7 +1015,8 @@ high level (needs more precise definitions) those may be things like:
10071015
These goals will help you determine what you need to measure (SLIs) in the next
10081016
question.
10091017
-->
1010-
Will be considered for beta.
1018+
Existing DRA and related SLOs continue to apply.
1019+
Pod scheduling duration with this feature should be as fast as existing DRA.
10111020

10121021
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
10131022

@@ -1021,15 +1030,22 @@ Pick one more of these and delete the rest.
10211030
- [ ] Other (treat as last resort)
10221031
- Details:
10231032
-->
1024-
Will be considered for beta.
1033+
These are the same as for the main DRA feature:
1034+
1035+
- [x] Metrics
1036+
- Metric name: resourceclaim_controller_creates_total
1037+
- Metric name: resourceclaim_controller_resource_claims
1038+
- Metric name: workqueue with name="resource_claim"
1039+
- Metric name: scheduler_pending_pods
1040+
- Metric name: scheduler_plugin_execution_duration_seconds
10251041

10261042
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10271043

10281044
<!--
10291045
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
10301046
implementation difficulties, etc.).
10311047
-->
1032-
Will be considered for beta.
1048+
No
10331049

10341050
### Dependencies
10351051

0 commit comments

Comments
 (0)