Skip to content

Commit c421c3b

Browse files
bitokuyliaog
authored andcommitted
Addressed comments and updates SLIs
Signed-off-by: Ayato Tokubi <[email protected]>
1 parent 3caec60 commit c421c3b

File tree

1 file changed

+40
-10
lines changed
  • keps/sig-scheduling/5004-dra-extended-resource

1 file changed

+40
-10
lines changed

keps/sig-scheduling/5004-dra-extended-resource/README.md

Lines changed: 40 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -960,7 +960,6 @@ This section must be completed when targeting beta to a release.
960960
For GA, this section is required: approvers should be able to confirm the
961961
previous answers based on experience in the field.
962962
-->
963-
Will be considered for beta.
964963

965964
###### How can an operator determine if the feature is in use by workloads?
966965

@@ -971,11 +970,12 @@ logs or events for this purpose.
971970
-->
972971
`kube_pod_resource_limit` and `kube_pod_resource_request`
973972
(label: `namespace`, `pod`, `node`, `scheduler`, `priority`, **`resource`**, `unit`)
974-
can be used to determine if the feature is in use by workloads though it doesn't differentiate
973+
can be used to determine if the feature is in use by workloads though it doesn't differentiate
975974
between extended resources backed by DRA or device plugin.
976975

977-
`resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`, `source`)
978-
should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
976+
We will add a new `source` label to`resourceclaim_controller_resource_claims` (label: `admin_access`, `allocated`),
977+
which can determine if the resource claim is created by extended resource or resource claim template.
978+
It should be a good metric to determine if the resource claim is created by extended resource backed by DRA.
979979

980980
###### How can someone using this feature know that it is working for their instance?
981981

@@ -995,7 +995,7 @@ Recall that end users cannot usually observe component logs or access metrics.
995995
- [ ] Other (treat as last resort)
996996
- Details:
997997
-->
998-
- [ x ] API .status
998+
- [x] API .status
999999
- Other field: `.status.extendedResourceClaimStatus` will have a list of resource claims that are created for
10001000
DRA extended resources.
10011001

@@ -1015,8 +1015,11 @@ high level (needs more precise definitions) those may be things like:
10151015
These goals will help you determine what you need to measure (SLIs) in the next
10161016
question.
10171017
-->
1018-
Existing DRA and related SLOs continue to apply.
1018+
1019+
Existing DRA and kube-scheduler SLOs continue to apply and must be maintained.
10191020
Pod scheduling duration with this feature should be as fast as existing DRA.
1021+
Since this feature implicitly affects the filtering phase of the NodeResourcesFit plugin,
1022+
the performance should be similar with no visible degradation compared to the baseline scheduling performance.
10201023

10211024
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
10221025

@@ -1030,14 +1033,41 @@ Pick one more of these and delete the rest.
10301033
- [ ] Other (treat as last resort)
10311034
- Details:
10321035
-->
1033-
These are the same as for the main DRA feature:
10341036

10351037
- [x] Metrics
1036-
- Metric name: resourceclaim_controller_creates_total
1037-
- Metric name: resourceclaim_controller_resource_claims
1038-
- Metric name: workqueue with name="resource_claim"
1038+
Values of each label are not thorough, picking up some example values which is related to this feature SLI.
1039+
**Existing metrics:**
1040+
- Metric name: workqueue
1041+
- Type: Gauge/Counter (multiple workqueue metrics)
1042+
- Labels: `name` ("resource_claim")
1043+
- Description: Multiple workqueue metrics including adds, depth, duration, and retries handled by workqueue
10391044
- Metric name: scheduler_pending_pods
1045+
- Type: Gauge
1046+
- Labels: `queue` ("active", "backoff", "unschedulable", "gated")
1047+
- Description: Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods that the scheduler attempted to schedule and failed; 'gated' is the number of unschedulable pods that the scheduler never attempted to schedule because they are gated
10401048
- Metric name: scheduler_plugin_execution_duration_seconds
1049+
- Type: Histogram
1050+
- Labels: `plugin` ("NodeResourcesFit", "DynamicResources"), `extension_point`, `status`
1051+
- Description: Duration for running a plugin at a specific extension point
1052+
- We need to monitor NodeResourcesFit, because this feature implicitly affects its filtering phase.
1053+
- Metric name: scheduler_pod_scheduling_sli_duration_seconds
1054+
- Type: Histogram
1055+
- Labels: `attempts`
1056+
- Description: E2e latency for a pod being scheduled, from the time the pod enters the scheduling queue and might involve multiple scheduling attempts
1057+
1058+
**Updating metrics:**
1059+
- Metric name: resourceclaim_controller_resource_claims
1060+
- Type: Gauge
1061+
- Labels: `admin_access` , `allocated`, `source` ("extended-resource", "resource-claim-template")
1062+
- Description: Number of ResourceClaims
1063+
- `source` label is newly added. It can be determined based on the `resource.kubernetes.io/extended-resource-claim` annotation of resource claims.
1064+
1065+
**New metrics:**
1066+
- Metric name: scheduler_resourceclaim_creates_total
1067+
- Type: Counter
1068+
- Labels: `status` ("failure", "success")
1069+
- Description: Total number of resource claim creation attempts by the scheduler
1070+
- Because the resource claim is created in scheduler, we need a different one from `resourceclaim_controller_creates_total`.
10411071

10421072
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10431073

0 commit comments

Comments
 (0)