Skip to content

Commit c575369

Browse files
committed
Don't clear the nominated node and continue to evaluate rest of nodes
Signed-off-by: Dave Chen <[email protected]>
1 parent 1a81f09 commit c575369

File tree

1 file changed

+16
-66
lines changed
  • keps/sig-scheduling/1923-try-nominated-node-first

1 file changed

+16
-66
lines changed

keps/sig-scheduling/1923-try-nominated-node-first/README.md

Lines changed: 16 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@
1010
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1111
- [Design Details](#design-details)
1212
- [Implementation Details](#implementation-details)
13-
- [Alternatives](#alternatives)
1413
- [Test Plan](#test-plan)
1514
- [Graduation Criteria](#graduation-criteria)
1615
- [Alpha (v1.21):](#alpha-v121)
@@ -47,7 +46,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4746

4847
If the scheduler fails to fit an incoming pod on any node, the scheduler will try to preempt lower
4948
priority pods running on a selected node to make room for the pod. The name of this node will be set
50-
in the pods' `pod.Status.NominatedNodeName`.
49+
in the pod's `pod.Status.NominatedNodeName`.
5150

5251
The Node is called *Nominated* to indicate the intent for the Pod to be scheduled on it once preemption
5352
of other Pods finish. However, the `Pod.status.nominatedNodeName` information is not directly used in
@@ -97,70 +96,21 @@ nominated node.
9796
1. In filtering phase, which is currently implemented in the method of `findNodesThatFitPod`, check the nominated node
9897
first if the incoming pod has the `pod.Status.NominatedNodeName` defined and the feature gate is enabled.
9998

100-
2. In case the nominated node doesn't suit for the incoming pod anymore, return `ErrNominateNode`
101-
instead of `core.FitError`, because this will give scheduler a chance to clean up the nominated
102-
node from the pod and find a new node to schedule instead of preemption again (might already have
103-
another node available for scheduling during the period). A fresh new scheduling cycle will be
104-
started later.
105-
106-
A new error `ErrNominateNode` should be defined to describe what's going wrong on the nominated node.
107-
108-
If the nominated node doesn't suit for the pod anymore, the scheduling failure will be recorded
109-
and the `updatePod` will be called, here we change the logic to update the pod as long as the
110-
parameter `nominatedNode` is different with what pod holds in `pod.Status.NominatedNodeName`.
111-
In this case, parameter `nominatedNode` is an empty string so that the nominated node will be
112-
cleaned from the pod and the pod will be moved to the active queue. It lets scheduler find another
113-
place for the pod in the next scheduling cycle.
114-
115-
### Alternatives
116-
117-
- Should keep trying on the nominated node in case the failure of scheduling?
118-
119-
This is the case when the pod deletion is still on the fly, the deletion of preemptor pods has
120-
been triggered and sent to apiserver but has not actually been deleted by `kubelet` or container runtime.
121-
122-
Here are several things we need to consider, and this is why this approach is not adopted,
123-
124-
1. Keep trying in this scheduling cycle until the deletion is done
125-
126-
This will block the scheduler and we never know when the deletion will be done, something might
127-
block this for a long time, for example, docker service is down and cannot get recovered.
128-
129-
2. Reserve the `pod.Status.NominatedNodeName` for the preemptor pod, so that the nominated node will be
130-
tried in the following scheduling cycle (not clean up the `nominatedNode` on failure)
131-
132-
This will not resolve the issue mentioned above either, this will generate an infinite looping on the
133-
nominated node.
134-
135-
3. There are other cases should be considered beside the pod deletion, which cause the nominated node
136-
not able to fit for the preemptor anymore, for example, nominated node becomes unschedulable, another
137-
node in the cluster releases enough room for the coming pod, topology update due to pod deletion on another
138-
node which makes the nominated node not fits for `PodTopologySpread` filter anymore.
139-
140-
All those cases require us to start a fresh new scheduling cycle and find a better one instead of the
141-
selected nominated node in previous cycle.
142-
143-
144-
- Should go on the preemption evaluation and try the nominated node there? update the nominated node if necessary.
145-
146-
In order to continue the preemption on the failure on the nominated node, scheduler should return `core.FitError`
147-
so that preemption will continue.
148-
149-
1. [Debatable]: For the issue #3 mentioned above, assume it will continue to go on the preemption evaluation,
150-
if there is another candidate node which doesn't preempt any victim pods, this node should be chosen as the
151-
new nominated node, it is also true if this is done in the new scheduling cycle, the same node will be chosen
152-
for both approaches, there is nearly no major difference, but the case like this looks more like should be handled
153-
by the normal scheduling process instead of pod preemption phase, this is sound like anti-pattern of what is the
154-
preemption designed.
155-
156-
2. If the nominated node doesn't fit due to the victim pods deletion is still on the fly, and nothing else is
157-
changed, the nominated node will be chosen again either it goes to preemption evaluation or after the normal
158-
scheduling cycle following by a preemption evaluation.
159-
We got more chance to finish the deletion for the latter case, and the nominated node will be chosen
160-
in the normal scheduling cycle or as the selected node in the following preemption evaluation phase.
161-
pod deletion might be triggered again on that victim pod/pods if finally go to the preemption, there is no harm
162-
to do that.
163-
But we need to note the shorter time might be needed for the former case.
99+
2. In case the nominated node doesn't suit for the incoming pod anymore, return `err` got from `findNodesThatPassFilters`,
100+
the `err` will be padded with more information to tell that scheduler is evaluating the feasibility of `NominatedNode`
101+
and failed on that node.
102+
103+
If no error is returned and cannot pass all the filtering, this is possibly caused by the resource that claims to be
104+
removed but has not been fully released yet, scheduler will continue to evaluate the rest of nodes to check if there
105+
is any node already available for the coming pod.
106+
107+
If scheduler still cannot find any node for the pod, scheduling will retry until matching either of the following cases,
108+
- `NominatedNode` eventually released all the resource and the preemptor pod can be scheduled on that node.
109+
- Another node in the cluster released enough release and pod get scheduled on that node instead.
110+
[Discuss] Should scheduler clear the `NominatedNode` in this case?
111+
- Resource cannot be released on the `NominatedNode` and no other candidate node could be found in the cluster, this will
112+
be covered by [issue 95752](https://github.com/kubernetes/kubernetes/issues/95752).
113+
164114

165115
### Test Plan
166116

0 commit comments

Comments
 (0)