You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md
+82-81Lines changed: 82 additions & 81 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,100 +99,105 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
99
99
100
100
## Summary
101
101
102
-
This KEP proposes improving scheduling queue behavior by popping pods from backoffQ when activeQ is empty.
103
-
This would allow to increase utilization of kube-scheduler cycles as well as reduce waiting time for pending pods
104
-
that were previously unschedulable.
102
+
This KEP proposes improving scheduling queue behavior by popping pods from the backoffQ when the activeQ is empty.
103
+
This would allow to process potentially schedulable pods ASAP, eliminating a penalty effect of the backoff queue.
105
104
106
105
## Motivation
107
106
108
-
There are three queues in scheduling queue:
109
-
- activeQ contains pods ready for scheduling,
107
+
There are three queues in the scheduler:
108
+
- activeQ contains pods ready for scheduling,
110
109
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change,
111
110
- backoffQ stores pods that failed scheduling attempts (either due to being unschedulable or errors) and could be schedulable again,
112
111
but applying a backoff penalty, scaled with the number of attempts.
113
112
114
-
When activeQ is not empty, scheduler pops the highest priority pod from activeQ.
115
-
However, when activeQ is empty, kube-scheduler idles, waiting for any pod being present in activeQ,
116
-
even if pods are in the backoffQ but their backoff period hasn't expired.
117
-
In scenarios when pods are waiting, but in backoffQ,
118
-
kube-scheduler should be able to consider those pods for scheduling, even if the backoff is not completed, to avoid the idle time.
113
+
When the activeQ is not empty, the scheduler pops the highest priority pod from the activeQ.
114
+
However, when the activeQ is empty, the kube-scheduler idles,
115
+
even if pods are in the backoffQ waiting for their backoff period to expire.
116
+
To avoid delaying assessment of potentially schedulable pods,
117
+
kube-scheduler should consider those pods for scheduling, even if the backoff time hasn't expired yet.
118
+
However, pods that are in the backoffQ due to errors, should not bypass the backoff time,
119
+
since it plays also rate limiting role, avoiding system overload due to too frequent retries.
119
120
120
121
### Goals
121
122
122
-
- Improve scheduling throughput and kube-scheduler utilization when activeQ is empty, but pods are waiting in backoffQ.
123
-
- Run `PreEnqueue` plugins when putting pod into backoffQ.
123
+
- Improve scheduling throughput and kube-scheduler utilization when the activeQ is empty, but pods are waiting in the backoffQ.
124
+
- Run `PreEnqueue` plugins when putting a pod into the backoffQ.
124
125
125
126
### Non-Goals
126
127
127
-
- Refactor scheduling queue by changing backoff logic or merging activeQ with backoffQ.
128
+
- Refactor the scheduling queue by changing backoff logic or merging the activeQ with the backoffQ.
128
129
129
130
## Proposal
130
131
131
-
At the beginning of scheduling cycle, pod is popped from activeQ.
132
-
If activeQ is empty, it waits until a pod is placed into the queue.
133
-
This KEP proposes to pop the pod from backoffQ when activeQ is empty.
132
+
At the beginning of the scheduling cycle, a pod is popped from the activeQ.
133
+
Currently, when activeQ is empty, it waits until some pod is placed into the queue.
134
+
This KEP proposes to pop the pod from the backoffQ when the activeQ is empty,
135
+
however moving pods from the backoffQ to activeQ will still work as before to avoid the problem of pods starvation,
136
+
which was the original reason of introducing the backoffQ.
134
137
135
-
To ensure the `PreEnqueue` is called for each pod taken into scheduling cycle,
136
-
`PreEnqueue` plugins would be called before putting pods into backoffQ.
137
-
It won't be done again when moving pods from backoffQ to activeQ.
138
+
To ensure the `PreEnqueue` is called for each pod taken into the scheduling cycle,
139
+
`PreEnqueue` plugins would be called before putting pods into the backoffQ.
140
+
It won't be done again when moving pods from the backoffQ to the activeQ.
138
141
139
142
### Risks and Mitigations
140
143
141
144
#### A tiny delay on the first scheduling attempts for newly created pods
142
145
143
-
While the scheduler handles a pod directly popping from backoffQ, another pod that should be scheduled before the pod being scheduled now, may appear in activeQ.
146
+
While the scheduler handles a pod directly popping from the backoffQ, another pod that should be scheduled before the pod being scheduled now, may appear in the activeQ.
144
147
However, in the real world, if the scheduling latency is short enough, there won't be a visible downgrade in throughput.
145
-
This will only happen if there are no pods in activeQ, so this can be mitigated by an appropriate rate of pod creation.
148
+
This will only happen if there are no pods in the activeQ, so this can be mitigated by an appropriate rate of pod creation.
146
149
147
-
#### Backoff won't be working as natural rate limiter in case of errors
150
+
#### Backoff won't be working as a natural rate limiter in case of errors
148
151
149
-
In case of API calls errors (e.g. network issues), backoffQ allows to limit number of retries in a short term.
152
+
In case of API calls errors (e.g., network issues), the backoffQ allows to limit the number of retries in a short term.
150
153
This proposal will take those pods earlier, leading to losing this mechanism.
151
154
152
-
After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
153
-
it will be possible to distinguish pods backing off because of errors from those backing off because of unschedulable attempt.
154
-
This information could be used when popping, by filtering only the pods that are from unschedulable attempt or even splitting backoffQ.
155
+
After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
156
+
it will be possible to distinguish pods backing off because of errors from those backing off because of an unschedulable attempt.
157
+
This information could be used when popping, by filtering only the pods that are from an unschedulable attempt or even splitting the backoffQ.
155
158
156
-
This has to be resolved before the beta is released.
159
+
This has to be resolved before the beta is released, which means before the release of the feature.
157
160
158
-
#### One pod in backoffQ could starve the others
161
+
#### One pod in the backoffQ could starve the others
159
162
160
-
The head of BackoffQ is the pod with the closest backoff expiration,
163
+
The head of the BackoffQ is the pod with the closest backoff expiration,
161
164
and the backoff time is calculated based on the number of scheduling failures that the pod has experienced.
162
165
If one pod has a smaller attempt counter than others,
163
166
could the scheduler keep popping this pod ahead of other pods because the pod's backoff expires faster than others?
164
-
Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from backoffQ as well,
167
+
Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from the backoffQ as well,
165
168
which would make the backoff time of pods bigger every after the scheduling attempt,
166
169
and the pod that had a smaller attempt number eventually won't be popped out.
167
170
168
171
## Design Details
169
172
170
-
### Popping from backoffQ in activeQ's pop()
173
+
### Popping from the backoffQ in activeQ's pop()
171
174
172
175
To achieve the goal, activeQ's `pop()` method needs to be changed:
173
-
1. If activeQ is empty, then instead of waiting for a pod to arrive at activeQ, popping from backoffQ is tried.
174
-
2. If backoffQ is empty, then `pop()` is waiting for pod as previously.
175
-
3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.
176
+
1. If the activeQ is empty, then instead of waiting for a pod to arrive at the activeQ, popping from the backoffQ is tried.
177
+
2. If the backoffQ is empty, then `pop()` is waiting for a pod as previously.
178
+
3. If the backoffQ is not empty, then the pod is processed like the pod would be taken from the activeQ, including increasing attempts number.
176
179
It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
177
180
178
-
To support monitoring, when popping from backoffQ,
181
+
To support monitoring, when popping from the backoffQ,
179
182
the `scheduler_queue_incoming_pods_total` metric with an `activeQ` queue and a new `PopFromBackoffQ` event label will be incremented.
180
183
181
-
### Notifying activeQ condition when new pod appears in backoffQ
184
+
### Notifying activeQ condition when a new pod appears in the backoffQ
182
185
183
-
Pods might appear in backoffQ while `pop()` is hanging on point 2.
184
-
That's why it will be required to call `broadcast()` on condition after adding a pod to backoffQ.
186
+
Pods might appear in the backoffQ while `pop()` is hanging on point 2.
187
+
That's why it will be required to call `broadcast()` on the condition after adding a pod to the backoffQ.
185
188
It shouldn't cause any performance issues.
186
189
187
-
We could eventually want to move backoffQ under activeQ's lock, but it's out of scope of this KEP.
190
+
We could eventually want to move the backoffQ under activeQ's lock, but it's out of scope of this KEP.
188
191
189
-
### Calling PreEnqueue for backoffQ
192
+
### Calling PreEnqueue for the backoffQ
190
193
191
-
`PreEnqueue` plugins have to be called for every pod before they are taken to scheduling cycle.
192
-
Initially, those plugins were called before moving pod to activeQ.
193
-
With this proposal, `PreEnqueue` will need to be called before moving pod to backoffQ
194
-
and those calls need to be skipped for the pods that are moved later from backoffQ to activeQ.
195
-
At moveToActiveQ level, these two paths could be distinguished by checking if event is equal to `BackoffComplete`.
194
+
Currently, we call `PreEnqueue` at a single place, every time pods are being moved to the activeQ.
195
+
But, with this proposal, `PreEnqueue` will be called before moving a pod to the backoffQ, not when popping pods directly from the backoffQ.
196
+
Otherwise, a direct popping would be inefficient: it has to take the top backoffQ pod, check if it goes through `PreEnqueue` plugins,
197
+
if not check the next backoffQ pod, until it finds the pod that goes through all `PreEnqueue` plugins.
198
+
Also, it means we'd have two paths that `PreEnqueue` plugins are invoked: when new pods are created and entering the scheduling queue,
199
+
and when pods are pushed into the backoffQ.
200
+
At the moveToActiveQ level, these two paths could be distinguished by checking if the event is equal to `BackoffComplete`.
196
201
197
202
### Test Plan
198
203
@@ -213,20 +218,21 @@ to implement this enhancement.
213
218
214
219
##### e2e tests
215
220
216
-
Feature is scoped within kube-scheduler internally, so there is no interaction between other components.
217
-
Whole feature should be already covered by integration tests.
221
+
The feature is scoped within the kube-scheduler internally, so there is no interaction between other components.
222
+
The whole feature should be already covered by integration tests.
218
223
219
224
### Graduation Criteria
220
225
226
+
The feature will start from beta and be enabled by default, because it is an internal kube-scheduler feature and guarded by a flag.
227
+
221
228
#### Alpha
222
229
223
-
- Feature implemented behind a feature flag.
224
-
- All tests from [Test Plan](#test-plan) implemented.
230
+
N/A
225
231
226
232
#### Beta
227
233
228
-
-Gather feedback from users and fix reported bugs.
229
-
-Change the feature flag to be enabled by default.
234
+
-Feature implemented behind a feature flag and enabled by default.
235
+
-All tests from [Test Plan](#test-plan) implemented.
230
236
- Make sure [backoff in case of error](#backoff-wont-be-working-as-natural-rate-limiter-in-case-of-errors) is not skipped.
231
237
232
238
#### GA
@@ -237,16 +243,16 @@ Whole feature should be already covered by integration tests.
237
243
238
244
**Upgrade**
239
245
240
-
During the alpha period, users have to enable the feature gate `SchedulerPopFromBackoffQ` to opt in this feature.
241
-
This is purely in-memory feature for kube-scheduler, so no special actions are required outside the scheduler.
246
+
During the beta period, the feature gate `SchedulerPopFromBackoffQ`is enabled by default, so users don't need to opt in.
247
+
This is a purely in-memory feature for the kube-scheduler, so no special actions are required outside the scheduler.
242
248
243
249
**Downgrade**
244
250
245
251
Users need to disable the feature gate.
246
252
247
253
### Version Skew Strategy
248
254
249
-
This is purely in-memory feature for kube-scheduler, and hence no version skew strategy.
255
+
This is a purely in-memory feature for the kube-scheduler, and hence no version skew strategy.
250
256
251
257
## Production Readiness Review Questionnaire
252
258
@@ -260,47 +266,43 @@ This is purely in-memory feature for kube-scheduler, and hence no version skew s
260
266
261
267
###### Does enabling the feature change any default behavior?
262
268
263
-
Pods that are backing off might be scheduled earlier when activeQ is empty.
269
+
Pods that are backing off might be scheduled earlier when the activeQ is empty.
264
270
265
-
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
271
+
###### Can the feature be disabled once it has been enabled (i.e., can we roll back the enablement)?
266
272
267
273
Yes.
268
-
The feature can be disabled in Alpha and Beta versions
269
-
by restarting kube-scheduler with the feature-gate off.
274
+
The feature can be disabled in Beta version by restarting the kube-scheduler with the feature-gate off.
270
275
271
-
###### What happens if we reenable the feature if it was previously rolled back?
276
+
###### What happens if we re-enable the feature if it was previously rolled back?
272
277
273
-
The scheduler again starts to pop pods from backoffQ when activeQ is empty.
278
+
The scheduler again starts to pop pods from the backoffQ when the activeQ is empty.
274
279
275
280
###### Are there any tests for feature enablement/disablement?
276
281
277
-
Given it's purely in-memory feature and enablement/disablement requires restarting the component (to change the value of feature flag),
282
+
Given it's a purely in-memory feature and enablement/disablement requires restarting the component (to change the value of the feature flag),
278
283
having feature tests is enough.
279
284
280
285
### Rollout, Upgrade and Rollback Planning
281
286
282
-
<!--
283
-
This section must be completed when targeting beta to a release.
284
-
-->
285
-
286
287
###### How can a rollout or rollback fail? Can it impact already running workloads?
287
288
288
-
The partly failure in the rollout isn't there because the scheduler is the only component to rollout this feature.
289
+
The partial failure in the rollout isn't there because the scheduler is the only component to roll out this feature.
289
290
But, if upgrading the scheduler itself fails somehow, new Pods won't be scheduled anymore,
290
291
while Pods, which are already scheduled, won't be affected in any case.
291
292
292
293
###### What specific metrics should inform a rollback?
293
294
294
-
Abnormal values of metrics related to scheduling queue, meaning pods are stuck in activeQ:
295
-
-`scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable.
296
-
This could mean that pods from backoffQ are taken instead of those from activeQ.
297
-
-`scheduler_pending_pods` metric with `active` label is not decreasing, while with `backoff` is almost constant.
298
-
-`scheduler_queue_incoming_pods_total` metric with `PopFromBackoffQ` label is increasing when there are pods in activeQ.
299
-
-`scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
295
+
Abnormal values of metrics related to the scheduling queue, meaning pods are stuck in the activeQ:
296
+
- The `scheduler_schedule_attempts_total` metric with the `scheduled` label is almost constant, while there are pending pods that should be schedulable.
297
+
This could mean that pods from the backoffQ are taken instead of those from the activeQ.
298
+
- The `scheduler_pending_pods` metric with the `active` label is not decreasing, while with the `backoff` is almost constant.
299
+
- The `scheduler_queue_incoming_pods_total` metric with the `PopFromBackoffQ` label is increasing when there are pods in the activeQ.
300
+
If this metric with this specific label is always higher than for other labels, it could also mean that this feature should be rolled back.
301
+
- The `scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
300
302
301
303
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
302
304
303
-
No. This feature is a in-memory feature of the scheduler
305
+
No. This feature is an in-memory feature of the scheduler
304
306
and thus calculations start from the beginning every time the scheduler is restarted.
305
307
So, just upgrading it and upgrade->downgrade->upgrade are both the same.
306
308
@@ -312,16 +314,15 @@ No
312
314
313
315
###### How can an operator determine if the feature is in use by workloads?
314
316
315
-
This feature is used during scheduling when activeQ is empty and if the feature gate is enabled.
316
-
Also, `scheduler_queue_incoming_pods_total` could be checked, by querying for new `PopFromBackoffQ` event label.
317
+
They can check `scheduler_queue_incoming_pods_total` with the `PopFromBackoffQ` event label.
317
318
318
319
###### How can someone using this feature know that it is working for their instance?
319
320
320
321
N/A
321
322
322
323
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
323
324
324
-
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)),
325
+
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)),
325
326
and this feature shouldn't bring any regression there.
326
327
327
328
Based on that `schedule_attempts_total` shouldn't be less than 100 in a second,
@@ -369,7 +370,7 @@ No
369
370
370
371
No
371
372
372
-
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO,...) in any components?
373
+
###### Will enabling / using this feature result in a non-negligible increase of resource usage (CPU, RAM, disk, IO,...) in any components?
373
374
374
375
No
375
376
@@ -403,7 +404,7 @@ Why should this KEP _not_ be implemented?
403
404
404
405
### Move pods in flushBackoffQCompleted when activeQ is empty
405
406
406
-
Moving the pod popping from backoffQ to the existing `flushBackoffQCompleted` function (which already periodically moves pods to activeQ) avoids changing `PreEnqueue` behavior, but it has some downsides.
407
-
Because flushing runs every second, it would be needed to pop more pods when activeQ is empty.
408
-
This require to figure out how many pods to pop, either by making it configurable it or calculating it.
409
-
Also, if schedulable pods show up in activeQ between flushes, a bunch of pods from backoffQ might break activeQ priorities and slow down scheduling for the pods that are ready to go.
407
+
Moving the pod popping from the backoffQ to the existing `flushBackoffQCompleted` function (which already periodically moves pods to the activeQ) avoids changing `PreEnqueue` behavior, but it has some downsides.
408
+
Because flushing runs every second, it would be needed to pop more pods when the activeQ is empty.
409
+
This requires figuring out how many pods to pop, either by making it configurable or calculating it.
410
+
Also, if schedulable pods show up in the activeQ between flushes, a bunch of pods from the backoffQ might break activeQ priorities and slow down scheduling for the pods that are ready to go.
0 commit comments