You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md
+39-17Lines changed: 39 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,14 +105,22 @@ that were previously unschedulable.
105
105
106
106
## Motivation
107
107
108
-
When activeQ is empty, kube-scheduler is wasting its potential of scheduling pods.
108
+
There are three queues in scheduling queue:
109
+
- activeQ contains pods ready for scheduling,
110
+
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change,
111
+
- backoffQ stores pods that failed scheduling attempts (either due to being unschedulable or errors) and could be schedulable again,
112
+
but applying a backoff penalty, scaled with the number of attempts.
113
+
114
+
When activeQ is not empty, scheduler pops the highest priority pod from activeQ.
115
+
However, when activeQ is empty, kube-scheduler idles, waiting for any pod being present in activeQ,
116
+
even if pods are in the backoffQ but their backoff period hasn't expired.
109
117
In scenarios when pods are waiting, but in backoffQ,
110
-
kube-scheduler should have a possibility of scheduling those pods even if the backoff is not completed.
118
+
kube-scheduler should be able to consider those pods for scheduling, even if the backoff is not completed, to avoid the idle time.
111
119
112
120
### Goals
113
121
114
122
- Improve scheduling throughput and kube-scheduler utilization when activeQ is empty, but pods are waiting in backoffQ.
115
-
- Run PreEnqueue plugins when putting pod into backoffQ.
123
+
- Run `PreEnqueue` plugins when putting pod into backoffQ.
116
124
117
125
### Non-Goals
118
126
@@ -124,15 +132,18 @@ At the beginning of scheduling cycle, pod is popped from activeQ.
124
132
If activeQ is empty, it waits until a pod is placed into the queue.
125
133
This KEP proposes to pop the pod from backoffQ when activeQ is empty.
126
134
127
-
To ensure the PreEnqueue is called for each pod taken into scheduling cycle,
128
-
PreEnqueue plugins would be called before putting pods into backoffQ.
135
+
To ensure the `PreEnqueue` is called for each pod taken into scheduling cycle,
136
+
`PreEnqueue` plugins would be called before putting pods into backoffQ.
129
137
It won't be done again when moving pods from backoffQ to activeQ.
130
138
131
139
### Risks and Mitigations
132
140
133
141
#### Scheduling throughput might be affected
134
142
135
-
TODO
143
+
While popping from backoffQ, another pod might appear in activeQ ready to be scheduled.
144
+
If the pop operation is short enough, there won't be a visible downgrade in throughput.
145
+
The only concern might be that less pods from activeQ might be taken in some period of time in favor of backoffQ,
146
+
but that's a user responsibility to create enough amount of pods to be scheduled from activeQ, not to cause this KEP behavior to happen.
136
147
137
148
#### Backoff won't be working as natural rate limiter in case of errors
138
149
@@ -145,29 +156,34 @@ This information could be used when popping, by filtering only the pods that are
145
156
146
157
#### One pod in backoffQ could starve the others
147
158
148
-
TODO
159
+
If a pod popped from the backoffQ fails its scheduling attempt and come back to the queue, it might be selected again, ahead of other pods.
160
+
161
+
To prevent this, while popping pod from backoffQ, its attempt counter will be incremented as if it had been taken from the activeQ.
162
+
This will give other pods a chance to be scheduled.
149
163
150
164
## Design Details
151
165
152
166
### Popping from backoffQ in activeQ's pop()
153
167
154
-
To achieve the goal, activeQ's pop() method needs to be changed:
168
+
To achieve the goal, activeQ's `pop()` method needs to be changed:
155
169
1. If activeQ is empty, then instead of waiting on condition, popping from backoffQ is tried.
156
-
2. If backoffQ is empty, then pop() is waiting on condition as previously.
170
+
2. If backoffQ is empty, then `pop()` is waiting on condition as previously.
157
171
3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.
172
+
It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
158
173
159
174
### Notifying activeQ condition when new pod appears in backoffQ
160
175
161
-
Pods might appear in backoffQ while pop() is hanging on point 2.
162
-
That's why it will be required to call broadcast() on condition after adding a pod to backoffQ.
176
+
Pods might appear in backoffQ while `pop()` is hanging on point 2.
177
+
That's why it will be required to call `broadcast()` on condition after adding a pod to backoffQ.
178
+
It shouldn't cause any performance issues.
163
179
164
180
We could eventually want to move backoffQ under activeQ's lock, but it's out of scope of this KEP.
165
181
166
182
### Calling PreEnqueue for backoffQ
167
183
168
-
PreEnqueue plugins have to be called for every pod before they are taken to scheduling cycle.
184
+
`PreEnqueue` plugins have to be called for every pod before they are taken to scheduling cycle.
169
185
Initially, those plugins were called before moving pod to activeQ.
170
-
With this proposal, PreEnqueue will need to be called before moving pod to backoffQ
186
+
With this proposal, `PreEnqueue` will need to be called before moving pod to backoffQ
171
187
and those calls need to be skipped for the pods that are moved later from backoffQ to activeQ.
172
188
At moveToActiveQ level, these two paths could be distinguished by checking if event is equal to `BackoffComplete`.
173
189
@@ -268,7 +284,8 @@ while Pods, which are already scheduled, won't be affected in any case.
268
284
###### What specific metrics should inform a rollback?
269
285
270
286
Abnormal values of metrics related to scheduling queue, meaning pods are stuck in activeQ:
271
-
-`scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable. This could mean that pods from backoffQ are taken instead of those from activeQ.
287
+
-`scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable.
288
+
This could mean that pods from backoffQ are taken instead of those from activeQ.
272
289
-`scheduler_pending_pods` metric with `active` label is not decreasing, while with `backoff` is almost constant.
273
290
-`scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
274
291
@@ -294,9 +311,11 @@ N/A
294
311
295
312
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
296
313
297
-
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)), and this feature shouldn't bring any regression there.
314
+
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)),
315
+
and this feature shouldn't bring any regression there.
298
316
299
-
Based on that `schedule_attempts_total` shouldn't be less than 100 in a second.
317
+
Based on that `schedule_attempts_total` shouldn't be less than 100 in a second,
318
+
if there are enough unscheduled pods within the cluster.
300
319
301
320
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
302
321
@@ -374,4 +393,7 @@ Why should this KEP _not_ be implemented?
374
393
375
394
### Move pods in flushBackoffQCompleted when activeQ is empty
376
395
377
-
TODO
396
+
Moving the pod popping from backoffQ to the existing `flushBackoffQCompleted` function (which already periodically moves pods to activeQ) avoids changing `PreEnqueue` behavior, but it has some downsides.
397
+
Because flushing runs every second, it would be needed to pop more pods when activeQ is empty.
398
+
This require to figure out how many pods to pop, either by making it configurable it or calculating it.
399
+
Also, if schedulable pods show up in activeQ between flushes, a bunch of pods from backoffQ might break activeQ priorities and slow down scheduling for the pods that are ready to go.
0 commit comments