1
- # KEP-1923: Try Nominated Node First
1
+ # KEP-1923: Prefer Nominated Node
2
2
3
3
<!-- toc -->
4
4
- [ Release Signoff Checklist] ( #release-signoff-checklist )
@@ -44,36 +44,35 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
44
44
45
45
## Summary
46
46
47
- If the scheduler fails to fit an incoming pod on any node, the scheduler will try to preempt lower
48
- priority pods running on a selected node to make room for the pod. The name of this node will be set
49
- in the pod's ` pod.Status.NominatedNodeName ` .
47
+ This KEP proposes to change the scheduling cycle such that nominated node of a pod is evaluated first
48
+ and schedule the pod on that node if it fits. If the nominated node doesn't fit the pod, only then the
49
+ scheduling cycle continues with the standard logic of evaluating the rest of the nodes in the cluster.
50
+
51
+ ## Motivation
52
+
53
+ If the scheduler fails to fit an incoming pod on any node, it will try to preempt lower priority pods
54
+ running on a selected node to make room for the pod. The name of this node will be set in the
55
+ pod's ` .status.nominatedNodeName ` .
50
56
51
57
The Node is called * Nominated* to indicate the intent for the Pod to be scheduled on it once preemption
52
- of other Pods finish . However, the ` Pod.status.nominatedNodeName ` information is not directly used in
53
- the Pod's following scheduling attempts.
58
+ of other Pods finishes . However, the Pod's ` .status.nominatedNodeName ` information is not fully utilized
59
+ in the Pod's following scheduling attempts.
54
60
55
61
Pod scheduling is split into two phases, the scheduling cycle and the binding cycle, the scheduling cycle
56
62
primarily includes filtering and scoring.
57
63
58
64
When preemption happens in a previous scheduling cycle, there is a high chance that the nominated node is
59
65
the * only* node that satisfies the filters for the unscheduled Pod that triggered preemption.
60
66
61
- This KEP proposes to change the scheduling cycle such that nominated node of a pod is evaluated first
62
- and schedule the pod on that node if it fits. If the nominated node doesn't fit the pod, only then the
63
- scheduling cycle continues with the standard logic of evaluating the rest of the nodes in the cluster.
64
-
65
- ## Motivation
66
-
67
67
In real production environment, pods can have different priorites due to business needs, the preemption
68
68
could happen to make sure higher priority pods could get scheduled.
69
69
70
70
In cluster with large number of computing nodes, evaluating all nodes when scheduling a pod is time consuming.
71
71
72
72
### Goals
73
73
74
- In the case where ` pod.Status.NominatedNodeName ` is set for an incoming pod, the scheduler will evaluate the
75
- nominated node first; if the nominated node doesn't fit the pod, the scheduling cycle will continue to evaluate
76
- the rest of the nodes in the cluster just like we do today.
74
+ Prefer scheduling a pod to its ` .status.nominatedNodeName ` if set, if the nominated node doesn't fit the pod,
75
+ the scheduling cycle will continue to evaluate the rest of the nodes in the cluster just like we do today.
77
76
78
77
79
78
## Proposal
@@ -94,7 +93,7 @@ nominated node.
94
93
### Implementation Details
95
94
96
95
1 . In filtering phase, which is currently implemented in the method of ` findNodesThatFitPod ` , check the nominated node
97
- first if the incoming pod has the ` pod.Status.NominatedNodeName ` defined and the feature gate is enabled.
96
+ first if the incoming pod has the ` .status.nominatedNodeName ` defined and the feature gate is enabled.
98
97
99
98
2 . In case the nominated node doesn't suit for the incoming pod anymore, get ` err ` from ` findNodesThatPassFilters ` where
100
99
` NominatedNode ` is firstly evaluated, the ` err ` will be padded with more information to tell that scheduler is evaluating
@@ -108,7 +107,7 @@ nominated node.
108
107
109
108
Scheduler will retry until matching either of the following cases,
110
109
- ` NominatedNode ` eventually released all the resource and the preemptor pod can be scheduled on that node.
111
- - Another node in the cluster released enough release and pod get scheduled on that node instead.
110
+ - Another node in the cluster released enough resources and pod get scheduled on that node instead.
112
111
[ Discuss] Should scheduler clear the ` NominatedNode ` in this case?
113
112
- Resource cannot be released on the ` NominatedNode ` and no other candidate node could be found in the cluster, this will
114
113
be covered by [ issue 95752] ( https://github.com/kubernetes/kubernetes/issues/95752 ) .
@@ -144,7 +143,7 @@ _This section must be completed when targeting alpha to a release._
144
143
145
144
* ** How can this feature be enabled / disabled in a live cluster?**
146
145
- [x] Feature gate (also fill in values in ` kep.yaml ` )
147
- - Feature gate name: TryNominatedNodeFirst
146
+ - Feature gate name: PreferNominatedNode
148
147
- Components depending on the feature gate: kube-scheduler
149
148
150
149
* ** Are there any tests for feature enablement/disablement?**
0 commit comments