Skip to content

Commit 6446577

Browse files
bitokuyliaog
authored andcommitted
Update SLIs to explain how to use them as SLIs.
1 parent c421c3b commit 6446577

File tree

1 file changed

+15
-15
lines changed
  • keps/sig-scheduling/5004-dra-extended-resource

1 file changed

+15
-15
lines changed

keps/sig-scheduling/5004-dra-extended-resource/README.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -973,7 +973,7 @@ logs or events for this purpose.
973973
can be used to determine if the feature is in use by workloads though it doesn't differentiate
974974
between extended resources backed by DRA or device plugin.
975975

976-
We will add a new `source` label to`resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`),
976+
We will add a new `source` label to `resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`),
977977
which can determine if the resource claim is created by extended resource or resource claim template.
978978
It should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
979979

@@ -988,10 +988,10 @@ and operation of this feature.
988988
Recall that end users cannot usually observe component logs or access metrics.
989989

990990
- [ ] Events
991-
- Event Reason:
991+
- Event Reason:
992992
- [ ] API .status
993-
- Condition name:
994-
- Other field:
993+
- Condition name:
994+
- Other field:
995995
- [ ] Other (treat as last resort)
996996
- Details:
997997
-->
@@ -1035,39 +1035,39 @@ Pick one more of these and delete the rest.
10351035
-->
10361036

10371037
- [x] Metrics
1038-
Values of each label are not thorough, picking up some example values which is related to this feature SLI.
1038+
Values of each label are not exhaustive; we are providing some example values that are related to this feature's SLI.
10391039
**Existing metrics:**
10401040
- Metric name: workqueue
10411041
- Type: Gauge/Counter (multiple workqueue metrics)
10421042
- Labels: `name` ("resource_claim")
1043-
- Description: Multiple workqueue metrics including adds, depth, duration, and retries handled by workqueue
1043+
- SLI Usage: Monitor workqueue depth and duration to detect resource claim processing bottlenecks. High depth or duration values indicate potential performance issues in resource claim handling that could affect pod scheduling times.
10441044
- Metric name: scheduler_pending_pods
10451045
- Type: Gauge
10461046
- Labels: `queue` ("active", "backoff", "unschedulable", "gated")
1047-
- Description: Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods that the scheduler attempted to schedule and failed; 'gated' is the number of unschedulable pods that the scheduler never attempted to schedule because they are gated
1047+
- SLI Usage: Track increases in 'unschedulable' queues to identify when extended resource availability is preventing pod scheduling. Sustained high values may indicate resource constraint issues or misconfigurations.
10481048
- Metric name: scheduler_plugin_execution_duration_seconds
10491049
- Type: Histogram
10501050
- Labels: `plugin` ("NodeResourcesFit", "DynamicResources"), `extension_point`, `status`
1051-
- Description: Duration for running a plugin at a specific extension point
1052-
- We need to monitor NodeResourcesFit, because this feature implicitly affects its filtering phase.
1051+
- SLI Usage: Monitor latencies for NodeResourcesFit and DynamicResources plugins to ensure the extended resource integration doesn't introduce performance regressions.
1052+
- We need to monitor NodeResourcesFit because this feature implicitly affects its filtering phase.
10531053
- Metric name: scheduler_pod_scheduling_sli_duration_seconds
10541054
- Type: Histogram
10551055
- Labels: `attempts`
1056-
- Description: E2e latency for a pod being scheduled, from the time the pod enters the scheduling queue and might involve multiple scheduling attempts
1056+
- SLI Usage: Track end-to-end scheduling performance for pods using extended resources.
10571057

10581058
**Updating metrics:**
10591059
- Metric name: resourceclaim_controller_resource_claims
10601060
- Type: Gauge
1061-
- Labels: `admin_access` , `allocated`, `source` ("extended-resource", "resource-claim-template")
1062-
- Description: Number of ResourceClaims
1063-
- `source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
1061+
- Labels: `admin_access`, `allocated`, `source` ("extended-resource", "resource-claim-template")
1062+
- SLI Usage: Monitor the ratio of allocated vs. total resource claims filtered by `source="extended-resource"` to track resource utilization. A low ratio of allocated claims may indicate DRA driver or resource claim controller issues.
1063+
- The `source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
10641064

10651065
**New metrics:**
10661066
- Metric name: scheduler_resourceclaim_creates_total
10671067
- Type: Counter
10681068
- Labels: `status` ("failure", "success")
1069-
- Description: Total number of resource claim creation attempts by the scheduler
1070-
- Because the resource claim is created in scheduler, we need a different one from `resourceclaim_controller_creates_total`.
1069+
- SLI Usage: Calculate success rate to monitor the reliability of automatic resource claim creation. High failure rates indicate potential issues with extended resource configuration.
1070+
- Because the resource claim is created in the scheduler, we need a different metric from `resourceclaim_controller_creates_total`.
10711071

10721072
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10731073

0 commit comments

Comments
 (0)