You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This KEP proposes a new `AsyncPostFilter` extension point to enhance the scheduling throughput in the failure scenarios
188
-
by decoupling the API calls for the preemption from the scheduling cycle.
187
+
This KEP proposes decoupling the API calls for the preemption from the scheduling cycle, to enhance the scheduling throughput of the scheduling failure scenarios.
189
188
190
189
## Motivation
191
190
@@ -214,9 +213,7 @@ This flow allows us to decouple the API call to assign Pod to the Node from the
214
213
But, we have the similar problem with the preemption; the preemption is run at PostFilter extension point which is the part of the scheduling cycle.
215
214
The preemption has to make some API calls to update Pods' condition and delete Pods after all, which could block the scheduling throughput.
216
215
217
-
Similarly, DRA's PostFilter also makes some API calls to update ResourceClaim's status for the deallocation.
218
-
219
-
This KEP proposes introducing a new extension point to run something asynchronously so that we address this common problem with the existing PostFilter.
216
+
scheduler-perf [actually shows](https://github.com/kubernetes/kubernetes/blob/342da505bdefbd849b808cca3cb76c24a993025f/test/integration/scheduler_perf/config/performance-config.yaml#L641) currently the preemption scenario takes too long time, compared to others.
220
217
221
218
### Goals
222
219
@@ -225,10 +222,8 @@ List the specific goals of the KEP. What is it trying to achieve? How will we
225
222
know that this has succeeded?
226
223
-->
227
224
228
-
- Introduce a new `AsyncPostFilter` extension point in the scheduling framework.
229
-
-`AsyncPostFilter` is literally run asynchronously after `PostFilter` extension point.
230
-
- Until `AsyncPostFilter` is done, the Pod won't be rescheduled.
231
-
- Move API calls of `DefaultPreemption` plugin from `PostFilter` to `AsyncPostFilter`.
225
+
- The preemption plugin makes API calls for the preemption asynchronously after `PostFilter` extension point so that the scheduler can continue to other Pods' scheduling while making API calls for preemption.
226
+
- Until the preemption goroutine is done, the Pod won't be rescheduled.
232
227
233
228
### Non-Goals
234
229
@@ -237,7 +232,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion
237
232
and make progress.
238
233
-->
239
234
240
-
-Moving DRA's logic from `PostFilter` to `AsyncPostFilter` is not a goal of this KEP because it's an under-construction feature yet.
235
+
-Making the same enhancement for DRA is not a goal of this KEP because it's an under-construction feature yet.
241
236
- If DRA maintainers want, technically they can along with this KEP. But, at least in this KEP, we don't discuss how.
242
237
243
238
## Proposal
@@ -267,14 +262,7 @@ but we don't want to make any API calls at PostFilter because it slows down the
267
262
268
263
After this KEP is implemented, we determine which Pod(s) to preempt at `PostFilter`,
269
264
nominate the Pod based on the calculation,
270
-
and actually makes the API calls at `AsyncPostFilter`.
271
-
272
-
#### Story 2
273
-
274
-
We have a plugin running the dealocation of resource claim,
275
-
which requires some API calls.
276
-
277
-
After this KEP is implemented, we can move the whole logic to `AsyncPostFilter`.
265
+
and actually makes the API calls in the goroutine.
278
266
279
267
### Notes/Constraints/Caveats (Optional)
280
268
@@ -301,15 +289,15 @@ Consider including folks who also work outside the SIG or subproject.
301
289
302
290
#### When kube-apiserver is unstable
303
291
304
-
When kube-apiserver is unstable and API calls at `AsyncPostFilter` fails frequently,
292
+
When kube-apiserver is unstable and API calls at the preemption goroutine fails frequently,
305
293
the scheduler could make a non-best scheduling result
306
294
because the scheduler nominates pods at `PostFilter` though, those Pods won't be scheduled on nodes because the preemption API calls fail.
307
295
308
296
Let's say many mid-priority Pods are making the preemption API calls.
309
-
During `AsyncPostFilter` for them are runnning, the scheduler assumes they'll be scheduled at the Nodes eventually
297
+
During the preemption goroutine for them are runnning, the scheduler assumes they'll be scheduled at the Nodes eventually
310
298
that the preemptions are targeting via `.Status.NominatedNodeName`.
311
299
So, other mid-priority or lower priority Pods' scheduling take those preempter Pods into consideration,
312
-
which is correct if `AsyncPostFilter` finishes successful actually, while which results in non-best scheduling results.
300
+
which is correct if the preemption goroutine finishes successful actually, while which results in non-best scheduling results.
313
301
(Higher priority Pods won't be affected; Pods can take place of reserved for lower priority Pods via `.Status.NominatedNodeName`)
314
302
315
303
But, in the first place though, when kube-apiserver is unstable, the scheduler doesn't behave well
@@ -327,34 +315,22 @@ required) or even code snippets. If there's any ambiguity about HOW your
327
315
proposal will be implemented, this is the place to discuss them.
328
316
-->
329
317
330
-
We'll introduce a new extension point `AsyncPostFilter`.
331
-
`AsyncPostFilter` is placed after `PostFilter` so that we can do something which should be done synchronously at `PostFilter`
332
-
and then proceed to `AsyncPostFilter`.
333
-
334
-
The target Pod won't be queued back to the scheduling queue during `AsyncPostFilter` is running for them.
318
+
To achieve an asynchronous preemption, we will change the preemption plugin's implementation like the following:
319
+
1. The preemption PostFilter plugin calculates the preemption target and nominate the Pod for the Node. (We'll use `AddNominatedPod` API exposed from the scheduling framework to plugins.)
320
+
2. The preemption PostFilter plugin starts the goroutine to make API calls inside, and return success status (= not wait for the goroutine to finish).
321
+
3. The preemption plugin gates the Pod, which the preemption is in-progress, at PreEnqueue extension point so that the target Pod won't be retried during the preemption.
335
322
336
-
### Asynchronous Preemption
337
-
338
-
To achieve an asynchronous preemption, we'll calculate which Pods to preempt at `PostFilter`,
339
-
and then make API calls actually at `AsyncPostFilter`
340
-
341
-
Preemption `PostFilter` calculates which Pods to preempt like it does currently.
342
-
And, after the calculation, `PostFilter` nominates the preempter Pod for the Node via `AddNominatedPod`,
343
-
which makes the next scheduling cycle take this preempter Pod into consideration.
344
-
345
-
This `AddNominatedPod` only operates the scheduler's internal cache, and doesn't make any API calls, which means light weight.
346
-
347
-
Then, afterwards `AsyncPostFilter` makes actual API calls.
348
-
If `AsyncPostFilter` fails at some point, it reverts the nomination via `AddNominatedPod` with [`clearNominatedNode`](https://github.com/kubernetes/kubernetes/blob/f5c538418189e119a8dbb60e2a2b22394548e326/pkg/scheduler/schedule_one.go#L135).
349
-
If `AsyncPostFilter` succeeds, the Pod is queued back to the queue, and (hopefully) scheduled on the nominated node.
323
+
Then, afterwards the preemption goroutine makes actual API calls.
324
+
If the preemption goroutine fails at some point, it reverts the nomination via `AddNominatedPod` with [`clearNominatedNode`](https://github.com/kubernetes/kubernetes/blob/f5c538418189e119a8dbb60e2a2b22394548e326/pkg/scheduler/schedule_one.go#L135).
325
+
If the preemption goroutine succeeds, the Pod is queued back to the queue, and (hopefully) scheduled on the nominated node.
350
326
351
327
### Consideration to race condition
352
328
353
329
Thanks to the nomination at `PostFilter`, this new asynchronous preemption shouldn't make any race condition between several scheduling cycles.
354
330
355
331
Here, I'll discuss what happens in which scenario, and make sure there's no worry.
356
332
357
-
Let's say pod1 is during the preemption process (node1) at `AsyncPostFilter`, the next scheduling cycle is scheduling pod2.
333
+
Let's say pod1 is during the preemption process (node1) at the preemption goroutine, the next scheduling cycle is scheduling pod2.
358
334
359
335
#### The pod2's scheduling is successful (pod2 is equal or lower priority than pod1)
360
336
@@ -567,7 +543,7 @@ enhancement:
567
543
568
544
**Upgrade**
569
545
570
-
During the alpha period, users have to enable the feature gate `SchedulerAsyncPostFilter` to opt in this feature.
546
+
During the alpha period, users have to enable the feature gate `SchedulerAsyncPreemption` to opt in this feature.
571
547
This is purely internal feature for kube-scheduler, so no other special actions are required outside the scheduler.
572
548
573
549
**Downgrade**
@@ -634,7 +610,7 @@ well as the [existing list] of feature gates.
634
610
-->
635
611
636
612
-[x] Feature gate (also fill in values in `kep.yaml`)
637
-
- Feature gate name: `SchedulerAsyncPostFilter`
613
+
- Feature gate name: `SchedulerAsyncPreemption`
638
614
- Components depending on the feature gate:
639
615
-[ ] Other
640
616
- Describe the mechanism:
@@ -721,9 +697,7 @@ What signals should users be paying attention to when the feature is young
721
697
that might indicate a serious problem?
722
698
-->
723
699
724
-
Maybe something goes wrong with the preemption if
725
-
-`plugin_execution_duration_seconds{extension_point=AsyncPostFilter, plugin=DefaultPreemption}` takes too long time.
726
-
-`framework_extension_point_duration_seconds{extension_point=AsyncPostFilter}`: takes too long time.
700
+
Maybe something goes wrong with the preemption if `goroutines_duration_seconds{operation=preemption}` takes too long time.
727
701
728
702
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
729
703
@@ -795,7 +769,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
795
769
question.
796
770
-->
797
771
798
-
- The failure rate of AsyncPostFilter (`plugin_execution_total{status=error, extension_point=AsyncPostFilter, plugin=DefaultPreemption}`/`plugin_execution_total{extension_point=AsyncPostFilter, plugin=DefaultPreemption}`) should be < 0.01.
772
+
- The failure rate of the preemption goroutine (`goroutines_execution_total{result=error, operation=preemption}`/`goroutines_execution_total{operation=preemption}`) should be < 0.01.
799
773
800
774
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
801
775
@@ -804,7 +778,7 @@ Pick one more of these and delete the rest.
The scheduler starts to run more goroutines for `AsyncPostFilter`, so maybe the CPU usage go up.
905
+
The scheduler starts to run more goroutines in the preemption plugin, so maybe the CPU usage go up.
931
906
932
907
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
933
908
@@ -958,7 +933,7 @@ details). For now, we leave it here.
958
933
959
934
###### How does this feature react if the API server and/or etcd is unavailable?
960
935
961
-
In such cases, the preemption fails at `AsyncPostFilter`.
936
+
In such cases, API calls for the preemption fails in the preemption goroutines.
962
937
But, the scheduler cannot perform not only the preemption, but anything essentially because it cannot get objects, bind Pods to Nodes, etc.
963
938
964
939
###### What are other known failure modes?
@@ -1007,15 +982,15 @@ not need to be as detailed as the proposal, but should include enough
1007
982
information to express the idea and why it was not acceptable.
1008
983
-->
1009
984
1010
-
### Implement asynchronous preemption only, not introduce a new extension
985
+
### Introduce a new extension point
986
+
987
+
To make this kind of scenario easier to implement for other plugins, we can implement a new extension point `AsyncPostFilter`.
988
+
We calculate the preemption target and nominate the Pod for the Node at `PostFilter`, and then `AsyncPostFilter` starts asynchronously, in which the preemption plugin makes API calls for the preemption.
1011
989
1012
-
If we target the preemption plugin only, we can implement -
1013
-
1. The preemption PostFilter plugin calculates the preemption target and nominate the Pod for the Node.
1014
-
2. The preemption PostFilter plugin starts the goroutine to run API calls, and return success status (= not wait for the goroutine to finish).
1015
-
3. The preemption plugin gates the Pod, which the preemption is in-progress, at PreEnqueue extension point.
990
+
The Pod won't be queued back to the queue until `AsyncPostFilter` is done.
1016
991
1017
-
But, we have in-tree DRA plugin that also makes API calls at PostFilter, and maybe custom plugins also do.
1018
-
Therefore, this KEP proposes `AsyncPostFilter`extension point to enable all plugins to implement this kind of async behaviour.
992
+
We don't go with this idea because we can implement the async preemption without introducing a new extension point.
993
+
Adding a new extension point unnecessarily may result in the regret in the future, and also we can implement it if it's really necessary.
0 commit comments