Skip to content

Commit 954ae0c

Browse files
committed
address PRR review comments
1 parent c1f4e66 commit 954ae0c

File tree

1 file changed

+9
-3
lines changed
  • keps/sig-scheduling/5004-dra-extended-resource

1 file changed

+9
-3
lines changed

keps/sig-scheduling/5004-dra-extended-resource/README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -818,7 +818,6 @@ ensure `ExtendedResourceName`s are handled by the scheduler as described in this
818818
- The basic scoring in NodeResourcesFit has to be implemented and that the queueing hints have to work efficiently.
819819
- Keep the Alpha behavior to create the special resource claim in scheduler.
820820
- Gather feedback from developers and surveys
821-
- 3 examples of vendors making use of the extensions proposed in this KEP
822821
- Scalability tests that mirror real-world usage as determined by user feedback
823822
- Additional tests are in Testgrid and linked in KEP
824823
- All functionality completed
@@ -922,12 +921,12 @@ One indicator are unexpected restarts of the cluster control plane components
922921
(kube-scheduler, apiserver) or kubelet.
923922

924923
If the scheduler_pending_pods metric in the kube-scheduler suddenly increases, it can
925-
suggest that pods are no longer gettings scheduled which might be due to a problem with
924+
suggest that pods are no longer getting scheduled which might be due to a problem with
926925
the DRA scheduler plugin. Another are an increase in the number of pods that fail to start,
927926
as indicated by the kubelet_started_containers_errors_total metric.
928927

929928
If the node.status.Capacity for the extended resources for the devices do not decrease to zero,
930-
or a pod fail to be scheduled, or run on the node, it may indicate that the device plugin driver
929+
or a pod fails to be scheduled, or run on the node, it may indicate that the device plugin driver
931930
on the node for the devices is not properly replaced by the DRA driver.
932931

933932
In all cases further analysis of logs and pod events is needed to determine whether
@@ -1180,6 +1179,13 @@ For each of them, fill in the following information by copying the below templat
11801179
- Diagnostics: scheduler logs at level 5 show the reason for the scheduling failure.
11811180
- Testing: this is known, determinstic failure mode due to defined system limit, i.e., DRA requests must be no more than 128 devices.
11821181

1182+
- [API server priority & fairness limits extended resource claim creation requests]
1183+
- Detection: inspect metric scheduler_resourceclaim_creates_total, and API server priority & fairness limits
1184+
- Mitigations: adjust API sever priority and fairness limits if too low, to allow extended resource claim creation
1185+
- Diagnostics: API server and scheduler logs level 5 show the reason for the extended resource claim creation failure.
1186+
- Testing: creating pods with DRA extended resource requests at high rate, and at the same time, API server
1187+
priority and fairness limit too low, could trigger extended resource claim creation failure at scheduler.
1188+
11831189
###### What steps should be taken if SLOs are not being met to determine the problem?
11841190

11851191
## Implementation History

0 commit comments

Comments
 (0)