|
10 | 10 | - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
|
11 | 11 | - [Design Details](#design-details)
|
12 | 12 | - [Implementation Details](#implementation-details)
|
13 |
| - - [Alternatives](#alternatives) |
14 | 13 | - [Test Plan](#test-plan)
|
15 | 14 | - [Graduation Criteria](#graduation-criteria)
|
16 | 15 | - [Alpha (v1.21):](#alpha-v121)
|
@@ -47,7 +46,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
|
47 | 46 |
|
48 | 47 | If the scheduler fails to fit an incoming pod on any node, the scheduler will try to preempt lower
|
49 | 48 | priority pods running on a selected node to make room for the pod. The name of this node will be set
|
50 |
| -in the pods' `pod.Status.NominatedNodeName`. |
| 49 | +in the pod's `pod.Status.NominatedNodeName`. |
51 | 50 |
|
52 | 51 | The Node is called *Nominated* to indicate the intent for the Pod to be scheduled on it once preemption
|
53 | 52 | of other Pods finish. However, the `Pod.status.nominatedNodeName` information is not directly used in
|
@@ -97,70 +96,21 @@ nominated node.
|
97 | 96 | 1. In filtering phase, which is currently implemented in the method of `findNodesThatFitPod`, check the nominated node
|
98 | 97 | first if the incoming pod has the `pod.Status.NominatedNodeName` defined and the feature gate is enabled.
|
99 | 98 |
|
100 |
| -2. In case the nominated node doesn't suit for the incoming pod anymore, return `ErrNominateNode` |
101 |
| - instead of `core.FitError`, because this will give scheduler a chance to clean up the nominated |
102 |
| - node from the pod and find a new node to schedule instead of preemption again (might already have |
103 |
| - another node available for scheduling during the period). A fresh new scheduling cycle will be |
104 |
| - started later. |
105 |
| - |
106 |
| - A new error `ErrNominateNode` should be defined to describe what's going wrong on the nominated node. |
107 |
| - |
108 |
| - If the nominated node doesn't suit for the pod anymore, the scheduling failure will be recorded |
109 |
| - and the `updatePod` will be called, here we change the logic to update the pod as long as the |
110 |
| - parameter `nominatedNode` is different with what pod holds in `pod.Status.NominatedNodeName`. |
111 |
| - In this case, parameter `nominatedNode` is an empty string so that the nominated node will be |
112 |
| - cleaned from the pod and the pod will be moved to the active queue. It lets scheduler find another |
113 |
| - place for the pod in the next scheduling cycle. |
114 |
| - |
115 |
| -### Alternatives |
116 |
| - |
117 |
| -- Should keep trying on the nominated node in case the failure of scheduling? |
118 |
| - |
119 |
| - This is the case when the pod deletion is still on the fly, the deletion of preemptor pods has |
120 |
| - been triggered and sent to apiserver but has not actually been deleted by `kubelet` or container runtime. |
121 |
| - |
122 |
| - Here are several things we need to consider, and this is why this approach is not adopted, |
123 |
| - |
124 |
| - 1. Keep trying in this scheduling cycle until the deletion is done |
125 |
| - |
126 |
| - This will block the scheduler and we never know when the deletion will be done, something might |
127 |
| - block this for a long time, for example, docker service is down and cannot get recovered. |
128 |
| - |
129 |
| - 2. Reserve the `pod.Status.NominatedNodeName` for the preemptor pod, so that the nominated node will be |
130 |
| - tried in the following scheduling cycle (not clean up the `nominatedNode` on failure) |
131 |
| - |
132 |
| - This will not resolve the issue mentioned above either, this will generate an infinite looping on the |
133 |
| - nominated node. |
134 |
| - |
135 |
| - 3. There are other cases should be considered beside the pod deletion, which cause the nominated node |
136 |
| - not able to fit for the preemptor anymore, for example, nominated node becomes unschedulable, another |
137 |
| - node in the cluster releases enough room for the coming pod, topology update due to pod deletion on another |
138 |
| - node which makes the nominated node not fits for `PodTopologySpread` filter anymore. |
139 |
| - |
140 |
| - All those cases require us to start a fresh new scheduling cycle and find a better one instead of the |
141 |
| - selected nominated node in previous cycle. |
142 |
| - |
143 |
| - |
144 |
| -- Should go on the preemption evaluation and try the nominated node there? update the nominated node if necessary. |
145 |
| - |
146 |
| - In order to continue the preemption on the failure on the nominated node, scheduler should return `core.FitError` |
147 |
| - so that preemption will continue. |
148 |
| - |
149 |
| - 1. [Debatable]: For the issue #3 mentioned above, assume it will continue to go on the preemption evaluation, |
150 |
| - if there is another candidate node which doesn't preempt any victim pods, this node should be chosen as the |
151 |
| - new nominated node, it is also true if this is done in the new scheduling cycle, the same node will be chosen |
152 |
| - for both approaches, there is nearly no major difference, but the case like this looks more like should be handled |
153 |
| - by the normal scheduling process instead of pod preemption phase, this is sound like anti-pattern of what is the |
154 |
| - preemption designed. |
155 |
| - |
156 |
| - 2. If the nominated node doesn't fit due to the victim pods deletion is still on the fly, and nothing else is |
157 |
| - changed, the nominated node will be chosen again either it goes to preemption evaluation or after the normal |
158 |
| - scheduling cycle following by a preemption evaluation. |
159 |
| - We got more chance to finish the deletion for the latter case, and the nominated node will be chosen |
160 |
| - in the normal scheduling cycle or as the selected node in the following preemption evaluation phase. |
161 |
| - pod deletion might be triggered again on that victim pod/pods if finally go to the preemption, there is no harm |
162 |
| - to do that. |
163 |
| - But we need to note the shorter time might be needed for the former case. |
| 99 | +2. In case the nominated node doesn't suit for the incoming pod anymore, return `err` got from `findNodesThatPassFilters`, |
| 100 | + the `err` will be padded with more information to tell that scheduler is evaluating the feasibility of `NominatedNode` |
| 101 | + and failed on that node. |
| 102 | + |
| 103 | + If no error is returned and cannot pass all the filtering, this is possibly caused by the resource that claims to be |
| 104 | + removed but has not been fully released yet, scheduler will continue to evaluate the rest of nodes to check if there |
| 105 | + is any node already available for the coming pod. |
| 106 | + |
| 107 | + If scheduler still cannot find any node for the pod, scheduling will retry until matching either of the following cases, |
| 108 | + - `NominatedNode` eventually released all the resource and the preemptor pod can be scheduled on that node. |
| 109 | + - Another node in the cluster released enough release and pod get scheduled on that node instead. |
| 110 | + [Discuss] Should scheduler clear the `NominatedNode` in this case? |
| 111 | + - Resource cannot be released on the `NominatedNode` and no other candidate node could be found in the cluster, this will |
| 112 | + be covered by [issue 95752](https://github.com/kubernetes/kubernetes/issues/95752). |
| 113 | + |
164 | 114 |
|
165 | 115 | ### Test Plan
|
166 | 116 |
|
|
0 commit comments