Skip to content

Commit 57bd202

Browse files
committed
Fix nits, add a part about ordering
1 parent 259cf07 commit 57bd202

File tree

1 file changed

+38
-5
lines changed
  • keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty

1 file changed

+38
-5
lines changed

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,8 @@ This would allow to process potentially schedulable pods ASAP, eliminating a pen
106106

107107
There are three queues in the scheduler:
108108
- activeQ contains pods ready for scheduling,
109-
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change,
109+
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change.
110+
These pods are then moved to backoffQ to apply a penalty.
110111
- backoffQ stores pods that failed scheduling attempts (either due to being unschedulable or errors) and could be schedulable again,
111112
but applying a backoff penalty, scaled with the number of attempts.
112113

@@ -154,7 +155,9 @@ This proposal will take those pods earlier, leading to losing this mechanism.
154155

155156
After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
156157
it will be possible to distinguish pods backing off because of errors from those backing off because of an unschedulable attempt.
157-
This information could be used when popping, by filtering only the pods that are from an unschedulable attempt or even splitting the backoffQ.
158+
To preserve the efficiency of the pop() function, it will be necessary to divide the backoffQ into two queues:
159+
one for pods that were unschedulable, and another for those rejected due to an error.
160+
Then popping will be performed only from the former, keeping the error backoff intact.
158161

159162
This has to be resolved before the beta is released, which means before the release of the feature.
160163

@@ -165,9 +168,18 @@ and the backoff time is calculated based on the number of scheduling failures th
165168
If one pod has a smaller attempt counter than others,
166169
could the scheduler keep popping this pod ahead of other pods because the pod's backoff expires faster than others?
167170
Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from the backoffQ as well,
168-
which would make the backoff time of pods bigger every after the scheduling attempt,
171+
which would make the backoff time larger after each after the scheduling attempt,
169172
and the pod that had a smaller attempt number eventually won't be popped out.
170173

174+
### Low priority pod could be chosen to pop, even if high priority pod has a slightly later backoff expiration
175+
176+
Flushing from backoffQ to activeQ is done each second, taking all pods with backoff expired.
177+
It means that, when they come to activeQ, they are sorted by priority there and taken in this order from activeQ.
178+
It is important, because preemption of a lower priority pod could happen if a higher priority pod is scheduled later.
179+
180+
To mitigate this, key function of backoffQ's heap will be changed, quantifying the time to make one second windows in which pods will be sorted by priority.
181+
Those whole windows will be eventually flushed to activeQ, making no change in current behavior.
182+
171183
## Design Details
172184

173185
### Popping from the backoffQ in activeQ's pop()
@@ -176,7 +188,7 @@ To achieve the goal, activeQ's `pop()` method needs to be changed:
176188
1. If the activeQ is empty, then instead of waiting for a pod to arrive at the activeQ, popping from the backoffQ is tried.
177189
2. If the backoffQ is empty, then `pop()` is waiting for a pod as previously.
178190
3. If the backoffQ is not empty, then the pod is processed like the pod would be taken from the activeQ, including increasing attempts number.
179-
It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
191+
It is popping from a heap data structure, so it should be fast enough not to cause any performance troubles.
180192

181193
To support monitoring, when popping from the backoffQ,
182194
the `scheduler_queue_incoming_pods_total` metric with an `activeQ` queue and a new `PopFromBackoffQ` event label will be incremented.
@@ -199,6 +211,27 @@ Also, it means we'd have two paths that `PreEnqueue` plugins are invoked: when n
199211
and when pods are pushed into the backoffQ.
200212
At the moveToActiveQ level, these two paths could be distinguished by checking if the event is equal to `BackoffComplete`.
201213

214+
### Change backoffQ key function
215+
216+
As [mentioned](#low-priority-pod-could-be-chosen-to-pop-even-if-high-priority-pod-has-a-slightly-later-backoff-expiration) in risks,
217+
backoffQ's heap key function has to be changed to apply priority within 1 second windows.
218+
The actual implementation takes backoff expiration times of two pods and compares which is lower.
219+
The new version will cut the milliseconds and use priorities to compare pods within those windows.
220+
To make ordering predictable, in case of equal priorities within the same window,
221+
the whole backoff time expiration will be eventually compared. See the pseudocode:
222+
223+
```go
224+
func podsCompareBackoffCompleted(pInfo1, pInfo2 *framework.QueuedPodInfo) bool {
225+
if pInfo1.BackoffTime.InSeconds() != pInfo2.BackoffTime.InSeconds() {
226+
return pInfo1.BackoffTime.Before(pInfo2.BackoffTime)
227+
}
228+
if pInfo1.Priority != pInfo2.Priority {
229+
return pInfo1.Priority < pInfo2.Priority
230+
}
231+
return pInfo1.BackoffTime.Before(pInfo2.BackoffTime)
232+
}
233+
```
234+
202235
### Test Plan
203236

204237
[x] I/we understand the owners of the involved components may require updates to
@@ -233,7 +266,7 @@ N/A
233266

234267
- Feature implemented behind a feature flag and enabled by default.
235268
- All tests from [Test Plan](#test-plan) implemented.
236-
- Make sure [backoff in case of error](#backoff-wont-be-working-as-natural-rate-limiter-in-case-of-errors) is not skipped.
269+
- Make sure [backoff in case of error](#backoff-wont-be-working-as-a-natural-rate-limiter-in-case-of-errors) is not skipped.
237270

238271
#### GA
239272

0 commit comments

Comments
 (0)