You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Description: Multiple workqueue metrics including adds, depth, duration, and retries handled by workqueue
1043
+
- SLI Usage: Monitor workqueue depth and duration to detect resource claim processing bottlenecks. High depth or duration values indicate potential performance issues in resource claim handling that could affect pod scheduling times.
- Description: Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods that the scheduler attempted to schedule and failed; 'gated' is the number of unschedulable pods that the scheduler never attempted to schedule because they are gated
1047
+
- SLI Usage: Track increases in 'unschedulable' queues to identify when extended resource availability is preventing pod scheduling. Sustained high values may indicate resource constraint issues or misconfigurations.
- Description: Duration for running a plugin at a specific extension point
1052
-
- We need to monitor NodeResourcesFit, because this feature implicitly affects its filtering phase.
1051
+
- SLI Usage: Monitor latencies for NodeResourcesFit and DynamicResources plugins to ensure the extended resource integration doesn't introduce performance regressions.
1052
+
- We need to monitor NodeResourcesFit because this feature implicitly affects its filtering phase.
- SLI Usage: Monitor the ratio of allocated vs. total resource claims filtered by `source="extended-resource"` to track resource utilization. A low ratio of allocated claims may indicate DRA driver or resource claim controller issues.
1063
+
- The `source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
- Description: Total number of resource claim creation attempts by the scheduler
1070
-
- Because the resource claim is created in scheduler, we need a different one from `resourceclaim_controller_creates_total`.
1069
+
- SLI Usage: Calculate success rate to monitor the reliability of automatic resource claim creation. High failure rates indicate potential issues with extended resource configuration.
1070
+
- Because the resource claim is created in the scheduler, we need a different metric from `resourceclaim_controller_creates_total`.
1071
1071
1072
1072
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
0 commit comments