@@ -960,7 +960,6 @@ This section must be completed when targeting beta to a release.
960960For GA, this section is required : approvers should be able to confirm the
961961previous answers based on experience in the field.
962962-->
963- Will be considered for beta.
964963
965964# ##### How can an operator determine if the feature is in use by workloads?
966965
@@ -971,11 +970,12 @@ logs or events for this purpose.
971970-->
972971` kube_pod_resource_limit` and `kube_pod_resource_request`
973972(label : ` namespace` , `pod`, `node`, `scheduler`, `priority`, **`resource`**, `unit`)
974- can be used to determine if the feature is in use by workloads though it doesn't differentiate
973+ can be used to determine if the feature is in use by workloads though it doesn't differentiate
975974between extended resources backed by DRA or device plugin.
976975
977- `resourceclaim_controller_resource_claims` (label : ` admin_access` , `allocated`, `source`)
978- should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
976+ We will add a new `source` label to`resourceclaim_controller_resource_claims` (label : ` admin_access` , `allocated`),
977+ which can determine if the resource claim is created by extended resource or resource claim template.
978+ It should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
979979
980980# ##### How can someone using this feature know that it is working for their instance?
981981
@@ -995,7 +995,7 @@ Recall that end users cannot usually observe component logs or access metrics.
995995- [ ] Other (treat as last resort)
996996 - Details :
997997-->
998- - [ x ] API .status
998+ - [x ] API .status
999999 - Other field : ` .status.extendedResourceClaimStatus` will have a list of resource claims that are created for
10001000 DRA extended resources.
10011001
@@ -1015,8 +1015,11 @@ high level (needs more precise definitions) those may be things like:
10151015These goals will help you determine what you need to measure (SLIs) in the next
10161016question.
10171017-->
1018- Existing DRA and related SLOs continue to apply.
1018+
1019+ Existing DRA and kube-scheduler SLOs continue to apply and must be maintained.
10191020Pod scheduling duration with this feature should be as fast as existing DRA.
1021+ Since this feature implicitly affects the filtering phase of the NodeResourcesFit plugin,
1022+ the performance should be similar with no visible degradation compared to the baseline scheduling performance.
10201023
10211024# ##### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
10221025
@@ -1030,14 +1033,41 @@ Pick one more of these and delete the rest.
10301033- [ ] Other (treat as last resort)
10311034 - Details :
10321035-->
1033- These are the same as for the main DRA feature :
10341036
10351037- [x] Metrics
1036- - Metric name : resourceclaim_controller_creates_total
1037- - Metric name : resourceclaim_controller_resource_claims
1038- - Metric name : workqueue with name="resource_claim"
1038+ Values of each label are not thorough, picking up some example values which is related to this feature SLI.
1039+ **Existing metrics:**
1040+ - Metric name : workqueue
1041+ - Type : Gauge/Counter (multiple workqueue metrics)
1042+ - Labels : ` name` ("resource_claim")
1043+ - Description : Multiple workqueue metrics including adds, depth, duration, and retries handled by workqueue
10391044 - Metric name : scheduler_pending_pods
1045+ - Type : Gauge
1046+ - Labels : ` queue` ("active", "backoff", "unschedulable", "gated")
1047+ - Description : Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods that the scheduler attempted to schedule and failed; 'gated' is the number of unschedulable pods that the scheduler never attempted to schedule because they are gated
10401048 - Metric name : scheduler_plugin_execution_duration_seconds
1049+ - Type : Histogram
1050+ - Labels : ` plugin` ("NodeResourcesFit", "DynamicResources"), `extension_point`, `status`
1051+ - Description : Duration for running a plugin at a specific extension point
1052+ - We need to monitor NodeResourcesFit, because this feature implicitly affects its filtering phase.
1053+ - Metric name : scheduler_pod_scheduling_sli_duration_seconds
1054+ - Type : Histogram
1055+ - Labels : ` attempts`
1056+ - Description : E2e latency for a pod being scheduled, from the time the pod enters the scheduling queue and might involve multiple scheduling attempts
1057+
1058+ **Updating metrics:**
1059+ - Metric name : resourceclaim_controller_resource_claims
1060+ - Type : Gauge
1061+ - Labels : ` admin_access` , `allocated`, `source` ("extended-resource", "resource-claim-template")
1062+ - Description : Number of ResourceClaims
1063+ - ` source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
1064+
1065+ **New metrics:**
1066+ - Metric name : scheduler_resourceclaim_creates_total
1067+ - Type : Counter
1068+ - Labels : ` status` ("failure", "success")
1069+ - Description : Total number of resource claim creation attempts by the scheduler
1070+ - Because the resource claim is created in scheduler, we need a different one from `resourceclaim_controller_creates_total`.
10411071
10421072# ##### Are there any missing metrics that would be useful to have to improve observability of this feature?
10431073
0 commit comments