Skip to content

Commit e2533f7

Browse files
committed
review comments applied
1 parent 7d504eb commit e2533f7

File tree

2 files changed

+23
-43
lines changed

2 files changed

+23
-43
lines changed

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

Lines changed: 22 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -89,13 +89,10 @@ tags, and then generate with `hack/update-toc.sh`.
8989
- [Story 1: Prevent inappropriate scale downs by Cluster Autoscaler](#story-1-prevent-inappropriate-scale-downs-by-cluster-autoscaler)
9090
- [Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart)
9191
- [Risks and Mitigations](#risks-and-mitigations)
92-
- [NominatedNodeName can be set by other components now.](#nominatednodename-can-be-set-by-other-components-now)
9392
- [Confusing semantics of <code>NominatedNodeName</code>](#confusing-semantics-of-nominatednodename)
9493
- [Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver)
95-
- [Confusion if <code>NominatedNodeName</code> is different from <code>NodeName</code> after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all)
9694
- [Design Details](#design-details)
9795
- [The scheduler puts <code>NominatedNodeName</code>](#the-scheduler-puts-nominatednodename)
98-
- [External components put <code>NominatedNodeName</code>](#external-components-put-nominatednodename)
9996
- [The scheduler's cache for <code>NominatedNodeName</code>](#the-schedulers-cache-for-nominatednodename)
10097
- [The scheduler clears <code>NominatedNodeName</code> after scheduling failure](#the-scheduler-clears-nominatednodename-after-scheduling-failure)
10198
- [Kube-apiserver clears <code>NominatedNodeName</code> when receiving binding requests](#kube-apiserver-clears-nominatednodename-when-receiving-binding-requests)
@@ -126,6 +123,7 @@ tags, and then generate with `hack/update-toc.sh`.
126123
- [Non-Goals](#non-goals-1)
127124
- [User stories](#user-stories)
128125
- [Risks and Mitigations](#risks-and-mitigations-1)
126+
- [Confusion if <code>NominatedNodeName</code> is different from <code>NodeName</code> after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all)
129127
- [Design Details](#design-details-1)
130128
- [Test plan: Integration tests](#test-plan-integration-tests)
131129
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
@@ -264,14 +262,6 @@ We need a mechanism to be able to resume the already started work in majority of
264262

265263
### Risks and Mitigations
266264

267-
#### NominatedNodeName can be set by other components now.
268-
269-
There aren't any guardrails preventing other components from setting `NominatedNodeName` now.
270-
In such cases, the semantic is not well defined now and the outcome of it may not match user
271-
expectations.
272-
273-
This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo.
274-
275265
#### Confusing semantics of `NominatedNodeName`
276266

277267
Up until now, `NominatedNodeName` was expressing the decision made by scheduler to put a given
@@ -296,18 +286,6 @@ If we look from consumption point of view - these are effectively the same. We w
296286
to expose the information, that as of now a given node is considered as a potential placement
297287
for a given pod. It may change, but for now that's what considered.
298288

299-
On top of the simple state machine above we introduce the following rules:
300-
- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
301-
the beginning of the binding cycle.
302-
- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails).
303-
304-
Moreover:
305-
- Regardless of who set `NominatedNodeName`, its readers should always take that into
306-
consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes).
307-
- In case of faulty components (e.g. overallocation of nodes), these decisions will
308-
simply be rejected by the scheduler (and the `NominatedNodeName` will be cleared before
309-
moving the rejected pod to unschedulable).
310-
311289
#### Increasing the load to kube-apiserver
312290

313291
Setting a NominatedNodeName is an additional API call that then multiple components in the system
@@ -323,19 +301,6 @@ For cases with delayed binding, we make an argument that the additional calls ar
323301
there are other calls related to those operations (e.g. PV creation, PVC binding, etc.) - so the
324302
overhead of setting `NNN` is a smaller percentage of the whole e2e pod startup flow.
325303

326-
#### Confusion if `NominatedNodeName` is different from `NodeName` after all
327-
328-
If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,
329-
`NominatedNodeName` is just overwritten by a final decision of the scheduler.
330-
331-
But, if an external component updates `NominatedNodeName` that is set by the scheduler,
332-
the pod could end up having different `NominatedNodeName` and `NodeName`.
333-
334-
We will update the logic so that `NominatedNodeName` field is cleared during `binding` call
335-
336-
We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound
337-
is niche enough feature that doesn't justify an attempt to strengthening the validation.
338-
339304
## Design Details
340305

341306
<!--
@@ -377,11 +342,6 @@ We determine if each plugin is relevant to the pod by Skip status from PreFilter
377342
In this way, even if users have some PreBind custom plugins, they can implement `PreBindPreFlight()` appropriately
378343
so that the scheduler can wisely skip setting `NominatedNodeName`, taking their custom logic into consideration.
379344

380-
### External components put `NominatedNodeName`
381-
382-
There aren't any restrictions preventing other components from setting NominatedNodeName as of now.
383-
However, we don't have any validation of how that currently works.
384-
385345
### The scheduler's cache for `NominatedNodeName`
386346

387347
Here, we'll ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory.
@@ -405,8 +365,7 @@ found the nominated node unschedulable for the pod. This logic remains unchanged
405365

406366
### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
407367

408-
As discussed at [Confusion if `NominatedNodeName` is different from `NodeName` after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all),
409-
we update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
368+
We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
410369

411370
### Test Plan
412371

@@ -648,6 +607,13 @@ Unknown.
648607

649608
###### What steps should be taken if SLOs are not being met to determine the problem?
650609

610+
Since SLOs can be impacted by multiple components and mechanisms in kubernetes, there is not straightforward algorithm to determine the problem. The general approach to investigating issues is described below.
611+
612+
If kube-scheduler SLOs are not being met, we should first check if other components of kubernetes (e.g. kube-apiserver) are experiencing slowdown or increased error rates as well. If that is the case, we should find out whether there is a global issue with an already-determined cause.
613+
A longer turnaround in kube-apiserver handling API requests may result in rising values of `scheduling_algorithm_duration_seconds` and lower values of `schedule_attempts_total`.
614+
615+
If we suspect that there is an ongoing problem inside kube-scheduler and that it is triggered by handling nominated node names, we should check kube-scheduler logs for failed scheduling of pods that had been waiting for preemption of victims, or for failed binding of pods that have nominated node name set - and investigate further.
616+
651617
## Implementation History
652618

653619
- 7th May 2025: The initial KEP is submitted.
@@ -837,6 +803,19 @@ for big clusters where the performance is critical) because it's just one iterat
837803
(e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of
838804
Filter plugins, approximately. So, adding one iteration on top of that doesn't matter).
839805

806+
#### Confusion if `NominatedNodeName` is different from `NodeName` after all
807+
808+
If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,
809+
`NominatedNodeName` is just overwritten by a final decision of the scheduler.
810+
811+
But, if an external component updates `NominatedNodeName` that is set by the scheduler,
812+
the pod could end up having different `NominatedNodeName` and `NodeName`.
813+
814+
We will update the logic so that `NominatedNodeName` field is cleared during `binding` call
815+
816+
We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound
817+
is niche enough feature that doesn't justify an attempt to strengthening the validation.
818+
840819
##### Design Details
841820

842821
If we take into account external components setting `NominatedNodeName`, the design needs to be extended as following:

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ kep-number: 5278
33
authors:
44
- "@sanposhiho"
55
- "@wojtek-t"
6+
- "@ania-borowiec"
67
owning-sig: sig-scheduling
78
participating-sigs:
89
- sig-autoscaling

0 commit comments

Comments
 (0)