Skip to content

Commit bd7922d

Browse files
committed
apply dom4ha's comments
1 parent e2533f7 commit bd7922d

File tree

1 file changed

+19
-1
lines changed
  • keps/sig-scheduling/5278-nominated-node-name-for-expectation

1 file changed

+19
-1
lines changed

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,12 +90,14 @@ tags, and then generate with `hack/update-toc.sh`.
9090
- [Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart)
9191
- [Risks and Mitigations](#risks-and-mitigations)
9292
- [Confusing semantics of <code>NominatedNodeName</code>](#confusing-semantics-of-nominatednodename)
93+
- [Node nominations need to be considered together with reserving DRA resources](#node-nominations-need-to-be-considered-together-with-reserving-dra-resources)
9394
- [Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver)
9495
- [Design Details](#design-details)
9596
- [The scheduler puts <code>NominatedNodeName</code>](#the-scheduler-puts-nominatednodename)
9697
- [The scheduler's cache for <code>NominatedNodeName</code>](#the-schedulers-cache-for-nominatednodename)
9798
- [The scheduler clears <code>NominatedNodeName</code> after scheduling failure](#the-scheduler-clears-nominatednodename-after-scheduling-failure)
9899
- [Kube-apiserver clears <code>NominatedNodeName</code> when receiving binding requests](#kube-apiserver-clears-nominatednodename-when-receiving-binding-requests)
100+
- [Handling ResourceClaim status updates](#handling-resourceclaim-status-updates)
99101
- [Test Plan](#test-plan)
100102
- [Prerequisite testing updates](#prerequisite-testing-updates)
101103
- [Unit tests](#unit-tests)
@@ -211,6 +213,8 @@ misunderstands the node is low-utilized (because the scheduler keeps the place o
211213
We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
212214
based on the expected pod placement.
213215

216+
Please note that the `NominatedNodeName` can express reservation of node resources only, but some resources can be managed by the DRA plugin and be expressed using ResourceClaim allocation. In order to correctly account all the resources needed by a pod, both the nomination and ResourceClaim status update needs to be reflected in the api-server.
217+
214218
### Retain the scheduling decision
215219

216220
At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod's scheduling result.
@@ -284,7 +288,15 @@ probably isn't that important - the content of `NominatedNodeName` can be interp
284288

285289
If we look from consumption point of view - these are effectively the same. We want
286290
to expose the information, that as of now a given node is considered as a potential placement
287-
for a given pod. It may change, but for now that's what considered.
291+
for a given pod. It may change, but for now that's what is being considered.
292+
293+
#### Node nominations need to be considered together with reserving DRA resources
294+
295+
The semantics of node nomination are in fact resource reservation, either in scheduler memory or in external components (after the nomination got persisted to the api-server). Since pods consume both node resources and DRA resources, it's important to persist both at the same (or almost the same) point in time.
296+
297+
This is consistent with the current implementation: ResourceClaim allocation is stored in status in PreBinding phase, therefore in conjunction to node nomination it effectively allows to reserve a complete set of resources (both node and DRA) to enable their correct accounting.
298+
299+
Note that node nomination is set before WaitOnPermit phase, but ResourceClaim status gets published in PreBinding, therefore pods waiting on WaitOnPermit would have only nominations published, and not ResourceClaim statuses. This however is not considered an issue, as long as there are no in-tree plugins supporting WaitOnPermit, and the Gang Scheduling feature is starting in alpha. This means that the fix to this issue will block Gang Scheduling promotion to beta.
288300

289301
#### Increasing the load to kube-apiserver
290302

@@ -362,11 +374,17 @@ We'll ensure this scenario works correctly via tests.
362374

363375
As of now the scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle, if it
364376
found the nominated node unschedulable for the pod. This logic remains unchanged.
377+
378+
NOTE: The previous version of this KEP, that allowed external components to set `NominatedNodeName`, deliberately left the `NominatedNodeName` field unchanged after scheduling failure. With the KEP update for v1.35 this logic is being reverted, and scheduler goes back to clearing the field after scheduling failure.
365379

366380
### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
367381

368382
We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
369383

384+
### Handling ResourceClaim status updates
385+
386+
Since ResourceClaim status update is complementary to node nomination (reserves resources in a similar way), it's desired that both will be set at the beginning of the PreBinding phase (before the pod starts waiting for resources to be ready for binding). The order of actions in the device management plugin is correct, however the scheduler performs the prebinding actions of different plugins sequentially. As a result it may happen that e.g. a long lasting PVC provisioning may delay exporting ResourceClaim allocation status. This is not desired, as it allows a gap in time when DRA resources are not reserved - causing problems similar to the ones originally fixed by this KEP - kubernetes/kubernetes#125491
387+
370388
### Test Plan
371389

372390
[x] I/we understand the owners of the involved components may require updates to

0 commit comments

Comments
 (0)