You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
+22-43Lines changed: 22 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,13 +89,10 @@ tags, and then generate with `hack/update-toc.sh`.
89
89
-[Story 1: Prevent inappropriate scale downs by Cluster Autoscaler](#story-1-prevent-inappropriate-scale-downs-by-cluster-autoscaler)
90
90
-[Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart)
91
91
-[Risks and Mitigations](#risks-and-mitigations)
92
-
-[NominatedNodeName can be set by other components now.](#nominatednodename-can-be-set-by-other-components-now)
93
92
-[Confusing semantics of <code>NominatedNodeName</code>](#confusing-semantics-of-nominatednodename)
94
93
-[Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver)
95
-
-[Confusion if <code>NominatedNodeName</code> is different from <code>NodeName</code> after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all)
-[External components put <code>NominatedNodeName</code>](#external-components-put-nominatednodename)
99
96
-[The scheduler's cache for <code>NominatedNodeName</code>](#the-schedulers-cache-for-nominatednodename)
100
97
-[The scheduler clears <code>NominatedNodeName</code> after scheduling failure](#the-scheduler-clears-nominatednodename-after-scheduling-failure)
101
98
-[Kube-apiserver clears <code>NominatedNodeName</code> when receiving binding requests](#kube-apiserver-clears-nominatednodename-when-receiving-binding-requests)
@@ -126,6 +123,7 @@ tags, and then generate with `hack/update-toc.sh`.
126
123
-[Non-Goals](#non-goals-1)
127
124
-[User stories](#user-stories)
128
125
-[Risks and Mitigations](#risks-and-mitigations-1)
126
+
-[Confusion if <code>NominatedNodeName</code> is different from <code>NodeName</code> after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all)
@@ -264,14 +262,6 @@ We need a mechanism to be able to resume the already started work in majority of
264
262
265
263
### Risks and Mitigations
266
264
267
-
#### NominatedNodeName can be set by other components now.
268
-
269
-
There aren't any guardrails preventing other components from setting `NominatedNodeName` now.
270
-
In such cases, the semantic is not well defined now and the outcome of it may not match user
271
-
expectations.
272
-
273
-
This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo.
274
-
275
265
#### Confusing semantics of `NominatedNodeName`
276
266
277
267
Up until now, `NominatedNodeName` was expressing the decision made by scheduler to put a given
@@ -296,18 +286,6 @@ If we look from consumption point of view - these are effectively the same. We w
296
286
to expose the information, that as of now a given node is considered as a potential placement
297
287
for a given pod. It may change, but for now that's what considered.
298
288
299
-
On top of the simple state machine above we introduce the following rules:
300
-
- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
301
-
the beginning of the binding cycle.
302
-
- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails).
303
-
304
-
Moreover:
305
-
- Regardless of who set `NominatedNodeName`, its readers should always take that into
306
-
consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes).
307
-
- In case of faulty components (e.g. overallocation of nodes), these decisions will
308
-
simply be rejected by the scheduler (and the `NominatedNodeName` will be cleared before
309
-
moving the rejected pod to unschedulable).
310
-
311
289
#### Increasing the load to kube-apiserver
312
290
313
291
Setting a NominatedNodeName is an additional API call that then multiple components in the system
@@ -323,19 +301,6 @@ For cases with delayed binding, we make an argument that the additional calls ar
323
301
there are other calls related to those operations (e.g. PV creation, PVC binding, etc.) - so the
324
302
overhead of setting `NNN` is a smaller percentage of the whole e2e pod startup flow.
325
303
326
-
#### Confusion if `NominatedNodeName` is different from `NodeName` after all
327
-
328
-
If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,
329
-
`NominatedNodeName` is just overwritten by a final decision of the scheduler.
330
-
331
-
But, if an external component updates `NominatedNodeName` that is set by the scheduler,
332
-
the pod could end up having different `NominatedNodeName` and `NodeName`.
333
-
334
-
We will update the logic so that `NominatedNodeName` field is cleared during `binding` call
335
-
336
-
We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound
337
-
is niche enough feature that doesn't justify an attempt to strengthening the validation.
338
-
339
304
## Design Details
340
305
341
306
<!--
@@ -377,11 +342,6 @@ We determine if each plugin is relevant to the pod by Skip status from PreFilter
377
342
In this way, even if users have some PreBind custom plugins, they can implement `PreBindPreFlight()` appropriately
378
343
so that the scheduler can wisely skip setting `NominatedNodeName`, taking their custom logic into consideration.
379
344
380
-
### External components put `NominatedNodeName`
381
-
382
-
There aren't any restrictions preventing other components from setting NominatedNodeName as of now.
383
-
However, we don't have any validation of how that currently works.
384
-
385
345
### The scheduler's cache for `NominatedNodeName`
386
346
387
347
Here, we'll ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory.
@@ -405,8 +365,7 @@ found the nominated node unschedulable for the pod. This logic remains unchanged
405
365
406
366
### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
407
367
408
-
As discussed at [Confusion if `NominatedNodeName` is different from `NodeName` after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all),
409
-
we update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
368
+
We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
410
369
411
370
### Test Plan
412
371
@@ -648,6 +607,13 @@ Unknown.
648
607
649
608
###### What steps should be taken if SLOs are not being met to determine the problem?
650
609
610
+
Since SLOs can be impacted by multiple components and mechanisms in kubernetes, there is not straightforward algorithm to determine the problem. The general approach to investigating issues is described below.
611
+
612
+
If kube-scheduler SLOs are not being met, we should first check if other components of kubernetes (e.g. kube-apiserver) are experiencing slowdown or increased error rates as well. If that is the case, we should find out whether there is a global issue with an already-determined cause.
613
+
A longer turnaround in kube-apiserver handling API requests may result in rising values of `scheduling_algorithm_duration_seconds` and lower values of `schedule_attempts_total`.
614
+
615
+
If we suspect that there is an ongoing problem inside kube-scheduler and that it is triggered by handling nominated node names, we should check kube-scheduler logs for failed scheduling of pods that had been waiting for preemption of victims, or for failed binding of pods that have nominated node name set - and investigate further.
616
+
651
617
## Implementation History
652
618
653
619
- 7th May 2025: The initial KEP is submitted.
@@ -837,6 +803,19 @@ for big clusters where the performance is critical) because it's just one iterat
837
803
(e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of
838
804
Filter plugins, approximately. So, adding one iteration on top of that doesn't matter).
839
805
806
+
#### Confusion if `NominatedNodeName` is different from `NodeName` after all
807
+
808
+
If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,
809
+
`NominatedNodeName` is just overwritten by a final decision of the scheduler.
810
+
811
+
But, if an external component updates `NominatedNodeName` that is set by the scheduler,
812
+
the pod could end up having different `NominatedNodeName` and `NodeName`.
813
+
814
+
We will update the logic so that `NominatedNodeName` field is cleared during `binding` call
815
+
816
+
We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound
817
+
is niche enough feature that doesn't justify an attempt to strengthening the validation.
818
+
840
819
##### Design Details
841
820
842
821
If we take into account external components setting `NominatedNodeName`, the design needs to be extended as following:
0 commit comments