@@ -818,7 +818,6 @@ ensure `ExtendedResourceName`s are handled by the scheduler as described in this
818818- The basic scoring in NodeResourcesFit has to be implemented and that the queueing hints have to work efficiently.
819819- Keep the Alpha behavior to create the special resource claim in scheduler.
820820- Gather feedback from developers and surveys
821- - 3 examples of vendors making use of the extensions proposed in this KEP
822821- Scalability tests that mirror real-world usage as determined by user feedback
823822- Additional tests are in Testgrid and linked in KEP
824823- All functionality completed
@@ -922,12 +921,12 @@ One indicator are unexpected restarts of the cluster control plane components
922921(kube-scheduler, apiserver) or kubelet.
923922
924923If the scheduler_pending_pods metric in the kube-scheduler suddenly increases, it can
925- suggest that pods are no longer gettings scheduled which might be due to a problem with
924+ suggest that pods are no longer getting scheduled which might be due to a problem with
926925the DRA scheduler plugin. Another are an increase in the number of pods that fail to start,
927926as indicated by the kubelet_started_containers_errors_total metric.
928927
929928If the node.status.Capacity for the extended resources for the devices do not decrease to zero,
930- or a pod fail to be scheduled, or run on the node, it may indicate that the device plugin driver
929+ or a pod fails to be scheduled, or run on the node, it may indicate that the device plugin driver
931930on the node for the devices is not properly replaced by the DRA driver.
932931
933932In all cases further analysis of logs and pod events is needed to determine whether
@@ -1180,6 +1179,13 @@ For each of them, fill in the following information by copying the below templat
11801179 - Diagnostics : scheduler logs at level 5 show the reason for the scheduling failure.
11811180 - Testing : this is known, determinstic failure mode due to defined system limit, i.e., DRA requests must be no more than 128 devices.
11821181
1182+ - [API server priority & fairness limits extended resource claim creation requests]
1183+ - Detection : inspect metric scheduler_resourceclaim_creates_total, and API server priority & fairness limits
1184+ - Mitigations : adjust API sever priority and fairness limits if too low, to allow extended resource claim creation
1185+ - Diagnostics : API server and scheduler logs level 5 show the reason for the extended resource claim creation failure.
1186+ - Testing : creating pods with DRA extended resource requests at high rate, and at the same time, API server
1187+ priority and fairness limit too low, could trigger extended resource claim creation failure at scheduler.
1188+
11831189# ##### What steps should be taken if SLOs are not being met to determine the problem?
11841190
11851191# # Implementation History
0 commit comments