Skip to content

Commit 83e27db

Browse files
committed
fix: swap the main proposal and the alternative
1 parent 058dfeb commit 83e27db

File tree

2 files changed

+35
-60
lines changed

2 files changed

+35
-60
lines changed

keps/sig-scheduling/4832-async-postfilter/README.md

Lines changed: 34 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -184,8 +184,7 @@ updates.
184184
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
185185
-->
186186

187-
This KEP proposes a new `AsyncPostFilter` extension point to enhance the scheduling throughput in the failure scenarios
188-
by decoupling the API calls for the preemption from the scheduling cycle.
187+
This KEP proposes decoupling the API calls for the preemption from the scheduling cycle, to enhance the scheduling throughput of the scheduling failure scenarios.
189188

190189
## Motivation
191190

@@ -214,9 +213,7 @@ This flow allows us to decouple the API call to assign Pod to the Node from the
214213
But, we have the similar problem with the preemption; the preemption is run at PostFilter extension point which is the part of the scheduling cycle.
215214
The preemption has to make some API calls to update Pods' condition and delete Pods after all, which could block the scheduling throughput.
216215

217-
Similarly, DRA's PostFilter also makes some API calls to update ResourceClaim's status for the deallocation.
218-
219-
This KEP proposes introducing a new extension point to run something asynchronously so that we address this common problem with the existing PostFilter.
216+
scheduler-perf [actually shows](https://github.com/kubernetes/kubernetes/blob/342da505bdefbd849b808cca3cb76c24a993025f/test/integration/scheduler_perf/config/performance-config.yaml#L641) currently the preemption scenario takes too long time, compared to others.
220217

221218
### Goals
222219

@@ -225,10 +222,8 @@ List the specific goals of the KEP. What is it trying to achieve? How will we
225222
know that this has succeeded?
226223
-->
227224

228-
- Introduce a new `AsyncPostFilter` extension point in the scheduling framework.
229-
- `AsyncPostFilter` is literally run asynchronously after `PostFilter` extension point.
230-
- Until `AsyncPostFilter` is done, the Pod won't be rescheduled.
231-
- Move API calls of `DefaultPreemption` plugin from `PostFilter` to `AsyncPostFilter`.
225+
- The preemption plugin makes API calls for the preemption asynchronously after `PostFilter` extension point so that the scheduler can continue to other Pods' scheduling while making API calls for preemption.
226+
- Until the preemption goroutine is done, the Pod won't be rescheduled.
232227

233228
### Non-Goals
234229

@@ -237,7 +232,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion
237232
and make progress.
238233
-->
239234

240-
- Moving DRA's logic from `PostFilter` to `AsyncPostFilter` is not a goal of this KEP because it's an under-construction feature yet.
235+
- Making the same enhancement for DRA is not a goal of this KEP because it's an under-construction feature yet.
241236
- If DRA maintainers want, technically they can along with this KEP. But, at least in this KEP, we don't discuss how.
242237

243238
## Proposal
@@ -267,14 +262,7 @@ but we don't want to make any API calls at PostFilter because it slows down the
267262

268263
After this KEP is implemented, we determine which Pod(s) to preempt at `PostFilter`,
269264
nominate the Pod based on the calculation,
270-
and actually makes the API calls at `AsyncPostFilter`.
271-
272-
#### Story 2
273-
274-
We have a plugin running the dealocation of resource claim,
275-
which requires some API calls.
276-
277-
After this KEP is implemented, we can move the whole logic to `AsyncPostFilter`.
265+
and actually makes the API calls in the goroutine.
278266

279267
### Notes/Constraints/Caveats (Optional)
280268

@@ -301,15 +289,15 @@ Consider including folks who also work outside the SIG or subproject.
301289

302290
#### When kube-apiserver is unstable
303291

304-
When kube-apiserver is unstable and API calls at `AsyncPostFilter` fails frequently,
292+
When kube-apiserver is unstable and API calls at the preemption goroutine fails frequently,
305293
the scheduler could make a non-best scheduling result
306294
because the scheduler nominates pods at `PostFilter` though, those Pods won't be scheduled on nodes because the preemption API calls fail.
307295

308296
Let's say many mid-priority Pods are making the preemption API calls.
309-
During `AsyncPostFilter` for them are runnning, the scheduler assumes they'll be scheduled at the Nodes eventually
297+
During the preemption goroutine for them are runnning, the scheduler assumes they'll be scheduled at the Nodes eventually
310298
that the preemptions are targeting via `.Status.NominatedNodeName`.
311299
So, other mid-priority or lower priority Pods' scheduling take those preempter Pods into consideration,
312-
which is correct if `AsyncPostFilter` finishes successful actually, while which results in non-best scheduling results.
300+
which is correct if the preemption goroutine finishes successful actually, while which results in non-best scheduling results.
313301
(Higher priority Pods won't be affected; Pods can take place of reserved for lower priority Pods via `.Status.NominatedNodeName`)
314302

315303
But, in the first place though, when kube-apiserver is unstable, the scheduler doesn't behave well
@@ -327,34 +315,22 @@ required) or even code snippets. If there's any ambiguity about HOW your
327315
proposal will be implemented, this is the place to discuss them.
328316
-->
329317

330-
We'll introduce a new extension point `AsyncPostFilter`.
331-
`AsyncPostFilter` is placed after `PostFilter` so that we can do something which should be done synchronously at `PostFilter`
332-
and then proceed to `AsyncPostFilter`.
333-
334-
The target Pod won't be queued back to the scheduling queue during `AsyncPostFilter` is running for them.
318+
To achieve an asynchronous preemption, we will change the preemption plugin's implementation like the following:
319+
1. The preemption PostFilter plugin calculates the preemption target and nominate the Pod for the Node. (We'll use `AddNominatedPod` API exposed from the scheduling framework to plugins.)
320+
2. The preemption PostFilter plugin starts the goroutine to make API calls inside, and return success status (= not wait for the goroutine to finish).
321+
3. The preemption plugin gates the Pod, which the preemption is in-progress, at PreEnqueue extension point so that the target Pod won't be retried during the preemption.
335322

336-
### Asynchronous Preemption
337-
338-
To achieve an asynchronous preemption, we'll calculate which Pods to preempt at `PostFilter`,
339-
and then make API calls actually at `AsyncPostFilter`
340-
341-
Preemption `PostFilter` calculates which Pods to preempt like it does currently.
342-
And, after the calculation, `PostFilter` nominates the preempter Pod for the Node via `AddNominatedPod`,
343-
which makes the next scheduling cycle take this preempter Pod into consideration.
344-
345-
This `AddNominatedPod` only operates the scheduler's internal cache, and doesn't make any API calls, which means light weight.
346-
347-
Then, afterwards `AsyncPostFilter` makes actual API calls.
348-
If `AsyncPostFilter` fails at some point, it reverts the nomination via `AddNominatedPod` with [`clearNominatedNode`](https://github.com/kubernetes/kubernetes/blob/f5c538418189e119a8dbb60e2a2b22394548e326/pkg/scheduler/schedule_one.go#L135).
349-
If `AsyncPostFilter` succeeds, the Pod is queued back to the queue, and (hopefully) scheduled on the nominated node.
323+
Then, afterwards the preemption goroutine makes actual API calls.
324+
If the preemption goroutine fails at some point, it reverts the nomination via `AddNominatedPod` with [`clearNominatedNode`](https://github.com/kubernetes/kubernetes/blob/f5c538418189e119a8dbb60e2a2b22394548e326/pkg/scheduler/schedule_one.go#L135).
325+
If the preemption goroutine succeeds, the Pod is queued back to the queue, and (hopefully) scheduled on the nominated node.
350326

351327
### Consideration to race condition
352328

353329
Thanks to the nomination at `PostFilter`, this new asynchronous preemption shouldn't make any race condition between several scheduling cycles.
354330

355331
Here, I'll discuss what happens in which scenario, and make sure there's no worry.
356332

357-
Let's say pod1 is during the preemption process (node1) at `AsyncPostFilter`, the next scheduling cycle is scheduling pod2.
333+
Let's say pod1 is during the preemption process (node1) at the preemption goroutine, the next scheduling cycle is scheduling pod2.
358334

359335
#### The pod2's scheduling is successful (pod2 is equal or lower priority than pod1)
360336

@@ -567,7 +543,7 @@ enhancement:
567543

568544
**Upgrade**
569545

570-
During the alpha period, users have to enable the feature gate `SchedulerAsyncPostFilter` to opt in this feature.
546+
During the alpha period, users have to enable the feature gate `SchedulerAsyncPreemption` to opt in this feature.
571547
This is purely internal feature for kube-scheduler, so no other special actions are required outside the scheduler.
572548

573549
**Downgrade**
@@ -634,7 +610,7 @@ well as the [existing list] of feature gates.
634610
-->
635611

636612
- [x] Feature gate (also fill in values in `kep.yaml`)
637-
- Feature gate name: `SchedulerAsyncPostFilter`
613+
- Feature gate name: `SchedulerAsyncPreemption`
638614
- Components depending on the feature gate:
639615
- [ ] Other
640616
- Describe the mechanism:
@@ -721,9 +697,7 @@ What signals should users be paying attention to when the feature is young
721697
that might indicate a serious problem?
722698
-->
723699

724-
Maybe something goes wrong with the preemption if
725-
- `plugin_execution_duration_seconds{extension_point=AsyncPostFilter, plugin=DefaultPreemption}` takes too long time.
726-
- `framework_extension_point_duration_seconds{extension_point=AsyncPostFilter}`: takes too long time.
700+
Maybe something goes wrong with the preemption if `goroutines_duration_seconds{operation=preemption}` takes too long time.
727701

728702
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
729703

@@ -795,7 +769,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
795769
question.
796770
-->
797771

798-
- The failure rate of AsyncPostFilter (`plugin_execution_total{status=error, extension_point=AsyncPostFilter, plugin=DefaultPreemption}`/`plugin_execution_total{extension_point=AsyncPostFilter, plugin=DefaultPreemption}`) should be < 0.01.
772+
- The failure rate of the preemption goroutine (`goroutines_execution_total{result=error, operation=preemption}`/`goroutines_execution_total{operation=preemption}`) should be < 0.01.
799773

800774
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
801775

@@ -804,7 +778,7 @@ Pick one more of these and delete the rest.
804778
-->
805779

806780
- [x] Metrics
807-
- Metric name: `plugin_execution_total{status=error, extension_point=AsyncPostFilter, plugin=DefaultPreemption}`
781+
- Metric name: `goroutines_execution_total{result=error, operation=preemption}`
808782
- Components exposing the metric: kube-scheduler
809783

810784
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -814,7 +788,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
814788
implementation difficulties, etc.).
815789
-->
816790

817-
- `plugin_execution_total` (w/ labels: `status`, `extension_point`, `plugin`): to observe how many times a new preemption plugin fails to run.
791+
- `goroutines_duration_seconds` (w/ label: `operation`): to observe how many preemption goroutines have failed.
792+
- `goroutines_execution_total` (w/ labels: `operation`, `result`): to observe how long each preemption goroutine takes to complete.
818793

819794
### Dependencies
820795

@@ -868,7 +843,7 @@ Focusing mostly on:
868843
heartbeats, leader election, etc.)
869844
-->
870845

871-
No. Just move the existing API calls from `PostFilter` to `AsyncPostFilter`.
846+
No. Just move the existing API calls from `PostFilter` into goroutines.
872847

873848
###### Will enabling / using this feature result in introducing new API types?
874849

@@ -927,7 +902,7 @@ This through this both in small and large cases, again with respect to the
927902
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
928903
-->
929904

930-
The scheduler starts to run more goroutines for `AsyncPostFilter`, so maybe the CPU usage go up.
905+
The scheduler starts to run more goroutines in the preemption plugin, so maybe the CPU usage go up.
931906

932907
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
933908

@@ -958,7 +933,7 @@ details). For now, we leave it here.
958933

959934
###### How does this feature react if the API server and/or etcd is unavailable?
960935

961-
In such cases, the preemption fails at `AsyncPostFilter`.
936+
In such cases, API calls for the preemption fails in the preemption goroutines.
962937
But, the scheduler cannot perform not only the preemption, but anything essentially because it cannot get objects, bind Pods to Nodes, etc.
963938

964939
###### What are other known failure modes?
@@ -1007,15 +982,15 @@ not need to be as detailed as the proposal, but should include enough
1007982
information to express the idea and why it was not acceptable.
1008983
-->
1009984

1010-
### Implement asynchronous preemption only, not introduce a new extension
985+
### Introduce a new extension point
986+
987+
To make this kind of scenario easier to implement for other plugins, we can implement a new extension point `AsyncPostFilter`.
988+
We calculate the preemption target and nominate the Pod for the Node at `PostFilter`, and then `AsyncPostFilter` starts asynchronously, in which the preemption plugin makes API calls for the preemption.
1011989

1012-
If we target the preemption plugin only, we can implement -
1013-
1. The preemption PostFilter plugin calculates the preemption target and nominate the Pod for the Node.
1014-
2. The preemption PostFilter plugin starts the goroutine to run API calls, and return success status (= not wait for the goroutine to finish).
1015-
3. The preemption plugin gates the Pod, which the preemption is in-progress, at PreEnqueue extension point.
990+
The Pod won't be queued back to the queue until `AsyncPostFilter` is done.
1016991

1017-
But, we have in-tree DRA plugin that also makes API calls at PostFilter, and maybe custom plugins also do.
1018-
Therefore, this KEP proposes `AsyncPostFilter` extension point to enable all plugins to implement this kind of async behaviour.
992+
We don't go with this idea because we can implement the async preemption without introducing a new extension point.
993+
Adding a new extension point unnecessarily may result in the regret in the future, and also we can implement it if it's really necessary.
1019994

1020995
## Infrastructure Needed (Optional)
1021996

keps/sig-scheduling/4832-async-postfilter/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ milestone:
2424
# The following PRR answers are required at alpha release
2525
# List the feature gate name and the components for which it must be enabled
2626
feature-gates:
27-
- name: SchedulerAsyncPostFilter
27+
- name: SchedulerAsyncPreemption
2828
components:
2929
- kube-scheduler
3030
disable-supported: true

0 commit comments

Comments
 (0)