You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md
+38-5Lines changed: 38 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -106,7 +106,8 @@ This would allow to process potentially schedulable pods ASAP, eliminating a pen
106
106
107
107
There are three queues in the scheduler:
108
108
- activeQ contains pods ready for scheduling,
109
-
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change,
109
+
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change.
110
+
These pods are then moved to backoffQ to apply a penalty.
110
111
- backoffQ stores pods that failed scheduling attempts (either due to being unschedulable or errors) and could be schedulable again,
111
112
but applying a backoff penalty, scaled with the number of attempts.
112
113
@@ -154,7 +155,9 @@ This proposal will take those pods earlier, leading to losing this mechanism.
154
155
155
156
After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
156
157
it will be possible to distinguish pods backing off because of errors from those backing off because of an unschedulable attempt.
157
-
This information could be used when popping, by filtering only the pods that are from an unschedulable attempt or even splitting the backoffQ.
158
+
To preserve the efficiency of the pop() function, it will be necessary to divide the backoffQ into two queues:
159
+
one for pods that were unschedulable, and another for those rejected due to an error.
160
+
Then popping will be performed only from the former, keeping the error backoff intact.
158
161
159
162
This has to be resolved before the beta is released, which means before the release of the feature.
160
163
@@ -165,9 +168,18 @@ and the backoff time is calculated based on the number of scheduling failures th
165
168
If one pod has a smaller attempt counter than others,
166
169
could the scheduler keep popping this pod ahead of other pods because the pod's backoff expires faster than others?
167
170
Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from the backoffQ as well,
168
-
which would make the backoff time of pods bigger every after the scheduling attempt,
171
+
which would make the backoff time larger after each after the scheduling attempt,
169
172
and the pod that had a smaller attempt number eventually won't be popped out.
170
173
174
+
### Low priority pod could be chosen to pop, even if high priority pod has a slightly later backoff expiration
175
+
176
+
Flushing from backoffQ to activeQ is done each second, taking all pods with backoff expired.
177
+
It means that, when they come to activeQ, they are sorted by priority there and taken in this order from activeQ.
178
+
It is important, because preemption of a lower priority pod could happen if a higher priority pod is scheduled later.
179
+
180
+
To mitigate this, key function of backoffQ's heap will be changed, quantifying the time to make one second windows in which pods will be sorted by priority.
181
+
Those whole windows will be eventually flushed to activeQ, making no change in current behavior.
182
+
171
183
## Design Details
172
184
173
185
### Popping from the backoffQ in activeQ's pop()
@@ -176,7 +188,7 @@ To achieve the goal, activeQ's `pop()` method needs to be changed:
176
188
1. If the activeQ is empty, then instead of waiting for a pod to arrive at the activeQ, popping from the backoffQ is tried.
177
189
2. If the backoffQ is empty, then `pop()` is waiting for a pod as previously.
178
190
3. If the backoffQ is not empty, then the pod is processed like the pod would be taken from the activeQ, including increasing attempts number.
179
-
It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
191
+
It is popping from a heap data structure, so it should be fast enough not to cause any performance troubles.
180
192
181
193
To support monitoring, when popping from the backoffQ,
182
194
the `scheduler_queue_incoming_pods_total` metric with an `activeQ` queue and a new `PopFromBackoffQ` event label will be incremented.
@@ -199,6 +211,27 @@ Also, it means we'd have two paths that `PreEnqueue` plugins are invoked: when n
199
211
and when pods are pushed into the backoffQ.
200
212
At the moveToActiveQ level, these two paths could be distinguished by checking if the event is equal to `BackoffComplete`.
201
213
214
+
### Change backoffQ key function
215
+
216
+
As [mentioned](#low-priority-pod-could-be-chosen-to-pop-even-if-high-priority-pod-has-a-slightly-later-backoff-expiration) in risks,
217
+
backoffQ's heap key function has to be changed to apply priority within 1 second windows.
218
+
The actual implementation takes backoff expiration times of two pods and compares which is lower.
219
+
The new version will cut the milliseconds and use priorities to compare pods within those windows.
220
+
To make ordering predictable, in case of equal priorities within the same window,
221
+
the whole backoff time expiration will be eventually compared. See the pseudocode:
0 commit comments