Addressed comments and updates SLIs

bitoku · yliaog · commit c421c3b7a4ba · 2025-10-02T15:36:11.000Z
Signed-off-by: Ayato Tokubi &lt;atokubi@redhat.com&gt;
diff --git a/keps/sig-scheduling/5004-dra-extended-resource/README.md b/keps/sig-scheduling/5004-dra-extended-resource/README.md
@@ -960,7 +960,6 @@ This section must be completed when targeting beta to a release.
 For GA, this section is required: approvers should be able to confirm the
 previous answers based on experience in the field.
 -->
-Will be considered for beta.
 
 ###### How can an operator determine if the feature is in use by workloads?
 
@@ -971,11 +970,12 @@ logs or events for this purpose.
 -->
 `kube_pod_resource_limit` and `kube_pod_resource_request`
 (label: `namespace`, `pod`, `node`, `scheduler`, `priority`, **`resource`**, `unit`)
-can be used to determine if the feature is in use by workloads though it doesn't differentiate 
+can be used to determine if the feature is in use by workloads though it doesn't differentiate
 between extended resources backed by DRA or device plugin.
 
-`resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`, `source`)
-should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
+We will add a new `source` label to`resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`),
+which can determine if the resource claim is created by extended resource or resource claim template.
+It should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
 
 ###### How can someone using this feature know that it is working for their instance?
 
@@ -995,7 +995,7 @@ Recall that end users cannot usually observe component logs or access metrics.
 - [ ] Other (treat as last resort)
   - Details:
 -->
-- [ x ] API .status
+- [x] API .status
     - Other field: `.status.extendedResourceClaimStatus` will have a list of resource claims that are created for
       DRA extended resources.
 
@@ -1015,8 +1015,11 @@ high level (needs more precise definitions) those may be things like:
 These goals will help you determine what you need to measure (SLIs) in the next
 question.
 -->
-Existing DRA and related SLOs continue to apply.
+
+Existing DRA and kube-scheduler SLOs continue to apply and must be maintained.
 Pod scheduling duration with this feature should be as fast as existing DRA.
+Since this feature implicitly affects the filtering phase of the NodeResourcesFit plugin,
+the performance should be similar with no visible degradation compared to the baseline scheduling performance.
 
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
@@ -1030,14 +1033,41 @@ Pick one more of these and delete the rest.
 - [ ] Other (treat as last resort)
   - Details:
 -->
-These are the same as for the main DRA feature:
 
 - [x] Metrics
-    - Metric name: resourceclaim_controller_creates_total
-    - Metric name: resourceclaim_controller_resource_claims
-    - Metric name: workqueue with name="resource_claim"
+  Values of each label are not thorough, picking up some example values which is related to this feature SLI.
+  **Existing metrics:**
+    - Metric name: workqueue
+        - Type: Gauge/Counter (multiple workqueue metrics)
+        - Labels: `name` ("resource_claim")
+        - Description: Multiple workqueue metrics including adds, depth, duration, and retries handled by workqueue
     - Metric name: scheduler_pending_pods
+        - Type: Gauge
+        - Labels: `queue` ("active", "backoff", "unschedulable", "gated")
+        - Description: Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods that the scheduler attempted to schedule and failed; 'gated' is the number of unschedulable pods that the scheduler never attempted to schedule because they are gated
     - Metric name: scheduler_plugin_execution_duration_seconds
+        - Type: Histogram
+        - Labels: `plugin` ("NodeResourcesFit", "DynamicResources"), `extension_point`, `status`
+        - Description: Duration for running a plugin at a specific extension point
+        - We need to monitor NodeResourcesFit, because this feature implicitly affects its filtering phase.
+    - Metric name: scheduler_pod_scheduling_sli_duration_seconds
+        - Type: Histogram
+        - Labels: `attempts`
+        - Description: E2e latency for a pod being scheduled, from the time the pod enters the scheduling queue and might involve multiple scheduling attempts
+
+**Updating metrics:**
+- Metric name: resourceclaim_controller_resource_claims
+    - Type: Gauge
+    - Labels: `admin_access` , `allocated`, `source` ("extended-resource", "resource-claim-template")
+    - Description: Number of ResourceClaims
+    - `source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
+
+**New metrics:**
+- Metric name: scheduler_resourceclaim_creates_total
+    - Type: Counter
+    - Labels: `status` ("failure", "success")
+    - Description: Total number of resource claim creation attempts by the scheduler
+    - Because the resource claim is created in scheduler, we need a different one from `resourceclaim_controller_creates_total`.
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?