You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
+19-1Lines changed: 19 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -90,12 +90,14 @@ tags, and then generate with `hack/update-toc.sh`.
90
90
-[Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart)
91
91
-[Risks and Mitigations](#risks-and-mitigations)
92
92
-[Confusing semantics of <code>NominatedNodeName</code>](#confusing-semantics-of-nominatednodename)
93
+
-[Node nominations need to be considered together with reserving DRA resources](#node-nominations-need-to-be-considered-together-with-reserving-dra-resources)
93
94
-[Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver)
@@ -211,6 +213,8 @@ misunderstands the node is low-utilized (because the scheduler keeps the place o
211
213
We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
212
214
based on the expected pod placement.
213
215
216
+
Please note that the `NominatedNodeName` can express reservation of node resources only, but some resources can be managed by the DRA plugin and be expressed using ResourceClaim allocation. In order to correctly account all the resources needed by a pod, both the nomination and ResourceClaim status update needs to be reflected in the api-server.
217
+
214
218
### Retain the scheduling decision
215
219
216
220
At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod's scheduling result.
@@ -284,7 +288,15 @@ probably isn't that important - the content of `NominatedNodeName` can be interp
284
288
285
289
If we look from consumption point of view - these are effectively the same. We want
286
290
to expose the information, that as of now a given node is considered as a potential placement
287
-
for a given pod. It may change, but for now that's what considered.
291
+
for a given pod. It may change, but for now that's what is being considered.
292
+
293
+
#### Node nominations need to be considered together with reserving DRA resources
294
+
295
+
The semantics of node nomination are in fact resource reservation, either in scheduler memory or in external components (after the nomination got persisted to the api-server). Since pods consume both node resources and DRA resources, it's important to persist both at the same (or almost the same) point in time.
296
+
297
+
This is consistent with the current implementation: ResourceClaim allocation is stored in status in PreBinding phase, therefore in conjunction to node nomination it effectively allows to reserve a complete set of resources (both node and DRA) to enable their correct accounting.
298
+
299
+
Note that node nomination is set before WaitOnPermit phase, but ResourceClaim status gets published in PreBinding, therefore pods waiting on WaitOnPermit would have only nominations published, and not ResourceClaim statuses. This however is not considered an issue, as long as there are no in-tree plugins supporting WaitOnPermit, and the Gang Scheduling feature is starting in alpha. This means that the fix to this issue will block Gang Scheduling promotion to beta.
288
300
289
301
#### Increasing the load to kube-apiserver
290
302
@@ -362,11 +374,17 @@ We'll ensure this scenario works correctly via tests.
362
374
363
375
As of now the scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle, if it
364
376
found the nominated node unschedulable for the pod. This logic remains unchanged.
377
+
378
+
NOTE: The previous version of this KEP, that allowed external components to set `NominatedNodeName`, deliberately left the `NominatedNodeName` field unchanged after scheduling failure. With the KEP update for v1.35 this logic is being reverted, and scheduler goes back to clearing the field after scheduling failure.
365
379
366
380
### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
367
381
368
382
We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
369
383
384
+
### Handling ResourceClaim status updates
385
+
386
+
Since ResourceClaim status update is complementary to node nomination (reserves resources in a similar way), it's desired that both will be set at the beginning of the PreBinding phase (before the pod starts waiting for resources to be ready for binding). The order of actions in the device management plugin is correct, however the scheduler performs the prebinding actions of different plugins sequentially. As a result it may happen that e.g. a long lasting PVC provisioning may delay exporting ResourceClaim allocation status. This is not desired, as it allows a gap in time when DRA resources are not reserved - causing problems similar to the ones originally fixed by this KEP - kubernetes/kubernetes#125491
387
+
370
388
### Test Plan
371
389
372
390
[x] I/we understand the owners of the involved components may require updates to
0 commit comments