Skip to content

Commit d2fc2db

Browse files
committed
Fill all todos and elaborate on some points
1 parent 58bc648 commit d2fc2db

File tree

1 file changed

+39
-17
lines changed
  • keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty

1 file changed

+39
-17
lines changed

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

Lines changed: 39 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -105,14 +105,22 @@ that were previously unschedulable.
105105

106106
## Motivation
107107

108-
When activeQ is empty, kube-scheduler is wasting its potential of scheduling pods.
108+
There are three queues in scheduling queue:
109+
- activeQ contains pods ready for scheduling,
110+
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change,
111+
- backoffQ stores pods that failed scheduling attempts (either due to being unschedulable or errors) and could be schedulable again,
112+
but applying a backoff penalty, scaled with the number of attempts.
113+
114+
When activeQ is not empty, scheduler pops the highest priority pod from activeQ.
115+
However, when activeQ is empty, kube-scheduler idles, waiting for any pod being present in activeQ,
116+
even if pods are in the backoffQ but their backoff period hasn't expired.
109117
In scenarios when pods are waiting, but in backoffQ,
110-
kube-scheduler should have a possibility of scheduling those pods even if the backoff is not completed.
118+
kube-scheduler should be able to consider those pods for scheduling, even if the backoff is not completed, to avoid the idle time.
111119

112120
### Goals
113121

114122
- Improve scheduling throughput and kube-scheduler utilization when activeQ is empty, but pods are waiting in backoffQ.
115-
- Run PreEnqueue plugins when putting pod into backoffQ.
123+
- Run `PreEnqueue` plugins when putting pod into backoffQ.
116124

117125
### Non-Goals
118126

@@ -124,15 +132,18 @@ At the beginning of scheduling cycle, pod is popped from activeQ.
124132
If activeQ is empty, it waits until a pod is placed into the queue.
125133
This KEP proposes to pop the pod from backoffQ when activeQ is empty.
126134

127-
To ensure the PreEnqueue is called for each pod taken into scheduling cycle,
128-
PreEnqueue plugins would be called before putting pods into backoffQ.
135+
To ensure the `PreEnqueue` is called for each pod taken into scheduling cycle,
136+
`PreEnqueue` plugins would be called before putting pods into backoffQ.
129137
It won't be done again when moving pods from backoffQ to activeQ.
130138

131139
### Risks and Mitigations
132140

133141
#### Scheduling throughput might be affected
134142

135-
TODO
143+
While popping from backoffQ, another pod might appear in activeQ ready to be scheduled.
144+
If the pop operation is short enough, there won't be a visible downgrade in throughput.
145+
The only concern might be that less pods from activeQ might be taken in some period of time in favor of backoffQ,
146+
but that's a user responsibility to create enough amount of pods to be scheduled from activeQ, not to cause this KEP behavior to happen.
136147

137148
#### Backoff won't be working as natural rate limiter in case of errors
138149

@@ -145,29 +156,34 @@ This information could be used when popping, by filtering only the pods that are
145156

146157
#### One pod in backoffQ could starve the others
147158

148-
TODO
159+
If a pod popped from the backoffQ fails its scheduling attempt and come back to the queue, it might be selected again, ahead of other pods.
160+
161+
To prevent this, while popping pod from backoffQ, its attempt counter will be incremented as if it had been taken from the activeQ.
162+
This will give other pods a chance to be scheduled.
149163

150164
## Design Details
151165

152166
### Popping from backoffQ in activeQ's pop()
153167

154-
To achieve the goal, activeQ's pop() method needs to be changed:
168+
To achieve the goal, activeQ's `pop()` method needs to be changed:
155169
1. If activeQ is empty, then instead of waiting on condition, popping from backoffQ is tried.
156-
2. If backoffQ is empty, then pop() is waiting on condition as previously.
170+
2. If backoffQ is empty, then `pop()` is waiting on condition as previously.
157171
3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.
172+
It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
158173

159174
### Notifying activeQ condition when new pod appears in backoffQ
160175

161-
Pods might appear in backoffQ while pop() is hanging on point 2.
162-
That's why it will be required to call broadcast() on condition after adding a pod to backoffQ.
176+
Pods might appear in backoffQ while `pop()` is hanging on point 2.
177+
That's why it will be required to call `broadcast()` on condition after adding a pod to backoffQ.
178+
It shouldn't cause any performance issues.
163179

164180
We could eventually want to move backoffQ under activeQ's lock, but it's out of scope of this KEP.
165181

166182
### Calling PreEnqueue for backoffQ
167183

168-
PreEnqueue plugins have to be called for every pod before they are taken to scheduling cycle.
184+
`PreEnqueue` plugins have to be called for every pod before they are taken to scheduling cycle.
169185
Initially, those plugins were called before moving pod to activeQ.
170-
With this proposal, PreEnqueue will need to be called before moving pod to backoffQ
186+
With this proposal, `PreEnqueue` will need to be called before moving pod to backoffQ
171187
and those calls need to be skipped for the pods that are moved later from backoffQ to activeQ.
172188
At moveToActiveQ level, these two paths could be distinguished by checking if event is equal to `BackoffComplete`.
173189

@@ -268,7 +284,8 @@ while Pods, which are already scheduled, won't be affected in any case.
268284
###### What specific metrics should inform a rollback?
269285

270286
Abnormal values of metrics related to scheduling queue, meaning pods are stuck in activeQ:
271-
- `scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable. This could mean that pods from backoffQ are taken instead of those from activeQ.
287+
- `scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable.
288+
This could mean that pods from backoffQ are taken instead of those from activeQ.
272289
- `scheduler_pending_pods` metric with `active` label is not decreasing, while with `backoff` is almost constant.
273290
- `scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
274291

@@ -294,9 +311,11 @@ N/A
294311

295312
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
296313

297-
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)), and this feature shouldn't bring any regression there.
314+
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)),
315+
and this feature shouldn't bring any regression there.
298316

299-
Based on that `schedule_attempts_total` shouldn't be less than 100 in a second.
317+
Based on that `schedule_attempts_total` shouldn't be less than 100 in a second,
318+
if there are enough unscheduled pods within the cluster.
300319

301320
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
302321

@@ -374,4 +393,7 @@ Why should this KEP _not_ be implemented?
374393

375394
### Move pods in flushBackoffQCompleted when activeQ is empty
376395

377-
TODO
396+
Moving the pod popping from backoffQ to the existing `flushBackoffQCompleted` function (which already periodically moves pods to activeQ) avoids changing `PreEnqueue` behavior, but it has some downsides.
397+
Because flushing runs every second, it would be needed to pop more pods when activeQ is empty.
398+
This require to figure out how many pods to pop, either by making it configurable it or calculating it.
399+
Also, if schedulable pods show up in activeQ between flushes, a bunch of pods from backoffQ might break activeQ priorities and slow down scheduling for the pods that are ready to go.

0 commit comments

Comments
 (0)