Skip to content

Commit afc3295

Browse files
committed
Improve wording, change target to beta
1 parent a205b3e commit afc3295

File tree

2 files changed

+87
-85
lines changed

2 files changed

+87
-85
lines changed

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md

Lines changed: 82 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -99,100 +99,105 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
9999

100100
## Summary
101101

102-
This KEP proposes improving scheduling queue behavior by popping pods from backoffQ when activeQ is empty.
103-
This would allow to increase utilization of kube-scheduler cycles as well as reduce waiting time for pending pods
104-
that were previously unschedulable.
102+
This KEP proposes improving scheduling queue behavior by popping pods from the backoffQ when the activeQ is empty.
103+
This would allow to process potentially schedulable pods ASAP, eliminating a penalty effect of the backoff queue.
105104

106105
## Motivation
107106

108-
There are three queues in scheduling queue:
109-
- activeQ contains pods ready for scheduling,
107+
There are three queues in the scheduler:
108+
- activeQ contains pods ready for scheduling,
110109
- unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change,
111110
- backoffQ stores pods that failed scheduling attempts (either due to being unschedulable or errors) and could be schedulable again,
112111
but applying a backoff penalty, scaled with the number of attempts.
113112

114-
When activeQ is not empty, scheduler pops the highest priority pod from activeQ.
115-
However, when activeQ is empty, kube-scheduler idles, waiting for any pod being present in activeQ,
116-
even if pods are in the backoffQ but their backoff period hasn't expired.
117-
In scenarios when pods are waiting, but in backoffQ,
118-
kube-scheduler should be able to consider those pods for scheduling, even if the backoff is not completed, to avoid the idle time.
113+
When the activeQ is not empty, the scheduler pops the highest priority pod from the activeQ.
114+
However, when the activeQ is empty, the kube-scheduler idles,
115+
even if pods are in the backoffQ waiting for their backoff period to expire.
116+
To avoid delaying assessment of potentially schedulable pods,
117+
kube-scheduler should consider those pods for scheduling, even if the backoff time hasn't expired yet.
118+
However, pods that are in the backoffQ due to errors, should not bypass the backoff time,
119+
since it plays also rate limiting role, avoiding system overload due to too frequent retries.
119120

120121
### Goals
121122

122-
- Improve scheduling throughput and kube-scheduler utilization when activeQ is empty, but pods are waiting in backoffQ.
123-
- Run `PreEnqueue` plugins when putting pod into backoffQ.
123+
- Improve scheduling throughput and kube-scheduler utilization when the activeQ is empty, but pods are waiting in the backoffQ.
124+
- Run `PreEnqueue` plugins when putting a pod into the backoffQ.
124125

125126
### Non-Goals
126127

127-
- Refactor scheduling queue by changing backoff logic or merging activeQ with backoffQ.
128+
- Refactor the scheduling queue by changing backoff logic or merging the activeQ with the backoffQ.
128129

129130
## Proposal
130131

131-
At the beginning of scheduling cycle, pod is popped from activeQ.
132-
If activeQ is empty, it waits until a pod is placed into the queue.
133-
This KEP proposes to pop the pod from backoffQ when activeQ is empty.
132+
At the beginning of the scheduling cycle, a pod is popped from the activeQ.
133+
Currently, when activeQ is empty, it waits until some pod is placed into the queue.
134+
This KEP proposes to pop the pod from the backoffQ when the activeQ is empty,
135+
however moving pods from the backoffQ to activeQ will still work as before to avoid the problem of pods starvation,
136+
which was the original reason of introducing the backoffQ.
134137

135-
To ensure the `PreEnqueue` is called for each pod taken into scheduling cycle,
136-
`PreEnqueue` plugins would be called before putting pods into backoffQ.
137-
It won't be done again when moving pods from backoffQ to activeQ.
138+
To ensure the `PreEnqueue` is called for each pod taken into the scheduling cycle,
139+
`PreEnqueue` plugins would be called before putting pods into the backoffQ.
140+
It won't be done again when moving pods from the backoffQ to the activeQ.
138141

139142
### Risks and Mitigations
140143

141144
#### A tiny delay on the first scheduling attempts for newly created pods
142145

143-
While the scheduler handles a pod directly popping from backoffQ, another pod that should be scheduled before the pod being scheduled now, may appear in activeQ.
146+
While the scheduler handles a pod directly popping from the backoffQ, another pod that should be scheduled before the pod being scheduled now, may appear in the activeQ.
144147
However, in the real world, if the scheduling latency is short enough, there won't be a visible downgrade in throughput.
145-
This will only happen if there are no pods in activeQ, so this can be mitigated by an appropriate rate of pod creation.
148+
This will only happen if there are no pods in the activeQ, so this can be mitigated by an appropriate rate of pod creation.
146149

147-
#### Backoff won't be working as natural rate limiter in case of errors
150+
#### Backoff won't be working as a natural rate limiter in case of errors
148151

149-
In case of API calls errors (e.g. network issues), backoffQ allows to limit number of retries in a short term.
152+
In case of API calls errors (e.g., network issues), the backoffQ allows to limit the number of retries in a short term.
150153
This proposal will take those pods earlier, leading to losing this mechanism.
151154

152-
After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
153-
it will be possible to distinguish pods backing off because of errors from those backing off because of unschedulable attempt.
154-
This information could be used when popping, by filtering only the pods that are from unschedulable attempt or even splitting backoffQ.
155+
After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
156+
it will be possible to distinguish pods backing off because of errors from those backing off because of an unschedulable attempt.
157+
This information could be used when popping, by filtering only the pods that are from an unschedulable attempt or even splitting the backoffQ.
155158

156-
This has to be resolved before the beta is released.
159+
This has to be resolved before the beta is released, which means before the release of the feature.
157160

158-
#### One pod in backoffQ could starve the others
161+
#### One pod in the backoffQ could starve the others
159162

160-
The head of BackoffQ is the pod with the closest backoff expiration,
163+
The head of the BackoffQ is the pod with the closest backoff expiration,
161164
and the backoff time is calculated based on the number of scheduling failures that the pod has experienced.
162165
If one pod has a smaller attempt counter than others,
163166
could the scheduler keep popping this pod ahead of other pods because the pod's backoff expires faster than others?
164-
Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from backoffQ as well,
167+
Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from the backoffQ as well,
165168
which would make the backoff time of pods bigger every after the scheduling attempt,
166169
and the pod that had a smaller attempt number eventually won't be popped out.
167170

168171
## Design Details
169172

170-
### Popping from backoffQ in activeQ's pop()
173+
### Popping from the backoffQ in activeQ's pop()
171174

172175
To achieve the goal, activeQ's `pop()` method needs to be changed:
173-
1. If activeQ is empty, then instead of waiting for a pod to arrive at activeQ, popping from backoffQ is tried.
174-
2. If backoffQ is empty, then `pop()` is waiting for pod as previously.
175-
3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.
176+
1. If the activeQ is empty, then instead of waiting for a pod to arrive at the activeQ, popping from the backoffQ is tried.
177+
2. If the backoffQ is empty, then `pop()` is waiting for a pod as previously.
178+
3. If the backoffQ is not empty, then the pod is processed like the pod would be taken from the activeQ, including increasing attempts number.
176179
It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
177180

178-
To support monitoring, when popping from backoffQ,
181+
To support monitoring, when popping from the backoffQ,
179182
the `scheduler_queue_incoming_pods_total` metric with an `activeQ` queue and a new `PopFromBackoffQ` event label will be incremented.
180183

181-
### Notifying activeQ condition when new pod appears in backoffQ
184+
### Notifying activeQ condition when a new pod appears in the backoffQ
182185

183-
Pods might appear in backoffQ while `pop()` is hanging on point 2.
184-
That's why it will be required to call `broadcast()` on condition after adding a pod to backoffQ.
186+
Pods might appear in the backoffQ while `pop()` is hanging on point 2.
187+
That's why it will be required to call `broadcast()` on the condition after adding a pod to the backoffQ.
185188
It shouldn't cause any performance issues.
186189

187-
We could eventually want to move backoffQ under activeQ's lock, but it's out of scope of this KEP.
190+
We could eventually want to move the backoffQ under activeQ's lock, but it's out of scope of this KEP.
188191

189-
### Calling PreEnqueue for backoffQ
192+
### Calling PreEnqueue for the backoffQ
190193

191-
`PreEnqueue` plugins have to be called for every pod before they are taken to scheduling cycle.
192-
Initially, those plugins were called before moving pod to activeQ.
193-
With this proposal, `PreEnqueue` will need to be called before moving pod to backoffQ
194-
and those calls need to be skipped for the pods that are moved later from backoffQ to activeQ.
195-
At moveToActiveQ level, these two paths could be distinguished by checking if event is equal to `BackoffComplete`.
194+
Currently, we call `PreEnqueue` at a single place, every time pods are being moved to the activeQ.
195+
But, with this proposal, `PreEnqueue` will be called before moving a pod to the backoffQ, not when popping pods directly from the backoffQ.
196+
Otherwise, a direct popping would be inefficient: it has to take the top backoffQ pod, check if it goes through `PreEnqueue` plugins,
197+
if not check the next backoffQ pod, until it finds the pod that goes through all `PreEnqueue` plugins.
198+
Also, it means we'd have two paths that `PreEnqueue` plugins are invoked: when new pods are created and entering the scheduling queue,
199+
and when pods are pushed into the backoffQ.
200+
At the moveToActiveQ level, these two paths could be distinguished by checking if the event is equal to `BackoffComplete`.
196201

197202
### Test Plan
198203

@@ -213,20 +218,21 @@ to implement this enhancement.
213218

214219
##### e2e tests
215220

216-
Feature is scoped within kube-scheduler internally, so there is no interaction between other components.
217-
Whole feature should be already covered by integration tests.
221+
The feature is scoped within the kube-scheduler internally, so there is no interaction between other components.
222+
The whole feature should be already covered by integration tests.
218223

219224
### Graduation Criteria
220225

226+
The feature will start from beta and be enabled by default, because it is an internal kube-scheduler feature and guarded by a flag.
227+
221228
#### Alpha
222229

223-
- Feature implemented behind a feature flag.
224-
- All tests from [Test Plan](#test-plan) implemented.
230+
N/A
225231

226232
#### Beta
227233

228-
- Gather feedback from users and fix reported bugs.
229-
- Change the feature flag to be enabled by default.
234+
- Feature implemented behind a feature flag and enabled by default.
235+
- All tests from [Test Plan](#test-plan) implemented.
230236
- Make sure [backoff in case of error](#backoff-wont-be-working-as-natural-rate-limiter-in-case-of-errors) is not skipped.
231237

232238
#### GA
@@ -237,16 +243,16 @@ Whole feature should be already covered by integration tests.
237243

238244
**Upgrade**
239245

240-
During the alpha period, users have to enable the feature gate `SchedulerPopFromBackoffQ` to opt in this feature.
241-
This is purely in-memory feature for kube-scheduler, so no special actions are required outside the scheduler.
246+
During the beta period, the feature gate `SchedulerPopFromBackoffQ` is enabled by default, so users don't need to opt in.
247+
This is a purely in-memory feature for the kube-scheduler, so no special actions are required outside the scheduler.
242248

243249
**Downgrade**
244250

245251
Users need to disable the feature gate.
246252

247253
### Version Skew Strategy
248254

249-
This is purely in-memory feature for kube-scheduler, and hence no version skew strategy.
255+
This is a purely in-memory feature for the kube-scheduler, and hence no version skew strategy.
250256

251257
## Production Readiness Review Questionnaire
252258

@@ -260,47 +266,43 @@ This is purely in-memory feature for kube-scheduler, and hence no version skew s
260266

261267
###### Does enabling the feature change any default behavior?
262268

263-
Pods that are backing off might be scheduled earlier when activeQ is empty.
269+
Pods that are backing off might be scheduled earlier when the activeQ is empty.
264270

265-
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
271+
###### Can the feature be disabled once it has been enabled (i.e., can we roll back the enablement)?
266272

267273
Yes.
268-
The feature can be disabled in Alpha and Beta versions
269-
by restarting kube-scheduler with the feature-gate off.
274+
The feature can be disabled in Beta version by restarting the kube-scheduler with the feature-gate off.
270275

271-
###### What happens if we reenable the feature if it was previously rolled back?
276+
###### What happens if we re-enable the feature if it was previously rolled back?
272277

273-
The scheduler again starts to pop pods from backoffQ when activeQ is empty.
278+
The scheduler again starts to pop pods from the backoffQ when the activeQ is empty.
274279

275280
###### Are there any tests for feature enablement/disablement?
276281

277-
Given it's purely in-memory feature and enablement/disablement requires restarting the component (to change the value of feature flag),
282+
Given it's a purely in-memory feature and enablement/disablement requires restarting the component (to change the value of the feature flag),
278283
having feature tests is enough.
279284

280285
### Rollout, Upgrade and Rollback Planning
281286

282-
<!--
283-
This section must be completed when targeting beta to a release.
284-
-->
285-
286287
###### How can a rollout or rollback fail? Can it impact already running workloads?
287288

288-
The partly failure in the rollout isn't there because the scheduler is the only component to rollout this feature.
289+
The partial failure in the rollout isn't there because the scheduler is the only component to roll out this feature.
289290
But, if upgrading the scheduler itself fails somehow, new Pods won't be scheduled anymore,
290291
while Pods, which are already scheduled, won't be affected in any case.
291292

292293
###### What specific metrics should inform a rollback?
293294

294-
Abnormal values of metrics related to scheduling queue, meaning pods are stuck in activeQ:
295-
- `scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable.
296-
This could mean that pods from backoffQ are taken instead of those from activeQ.
297-
- `scheduler_pending_pods` metric with `active` label is not decreasing, while with `backoff` is almost constant.
298-
- `scheduler_queue_incoming_pods_total` metric with `PopFromBackoffQ` label is increasing when there are pods in activeQ.
299-
- `scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
295+
Abnormal values of metrics related to the scheduling queue, meaning pods are stuck in the activeQ:
296+
- The `scheduler_schedule_attempts_total` metric with the `scheduled` label is almost constant, while there are pending pods that should be schedulable.
297+
This could mean that pods from the backoffQ are taken instead of those from the activeQ.
298+
- The `scheduler_pending_pods` metric with the `active` label is not decreasing, while with the `backoff` is almost constant.
299+
- The `scheduler_queue_incoming_pods_total` metric with the `PopFromBackoffQ` label is increasing when there are pods in the activeQ.
300+
If this metric with this specific label is always higher than for other labels, it could also mean that this feature should be rolled back.
301+
- The `scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
300302

301303
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
302304

303-
No. This feature is a in-memory feature of the scheduler
305+
No. This feature is an in-memory feature of the scheduler
304306
and thus calculations start from the beginning every time the scheduler is restarted.
305307
So, just upgrading it and upgrade->downgrade->upgrade are both the same.
306308

@@ -312,16 +314,15 @@ No
312314

313315
###### How can an operator determine if the feature is in use by workloads?
314316

315-
This feature is used during scheduling when activeQ is empty and if the feature gate is enabled.
316-
Also, `scheduler_queue_incoming_pods_total` could be checked, by querying for new `PopFromBackoffQ` event label.
317+
They can check `scheduler_queue_incoming_pods_total` with the `PopFromBackoffQ` event label.
317318

318319
###### How can someone using this feature know that it is working for their instance?
319320

320321
N/A
321322

322323
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
323324

324-
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)),
325+
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)),
325326
and this feature shouldn't bring any regression there.
326327

327328
Based on that `schedule_attempts_total` shouldn't be less than 100 in a second,
@@ -369,7 +370,7 @@ No
369370

370371
No
371372

372-
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
373+
###### Will enabling / using this feature result in a non-negligible increase of resource usage (CPU, RAM, disk, IO,...) in any components?
373374

374375
No
375376

@@ -403,7 +404,7 @@ Why should this KEP _not_ be implemented?
403404

404405
### Move pods in flushBackoffQCompleted when activeQ is empty
405406

406-
Moving the pod popping from backoffQ to the existing `flushBackoffQCompleted` function (which already periodically moves pods to activeQ) avoids changing `PreEnqueue` behavior, but it has some downsides.
407-
Because flushing runs every second, it would be needed to pop more pods when activeQ is empty.
408-
This require to figure out how many pods to pop, either by making it configurable it or calculating it.
409-
Also, if schedulable pods show up in activeQ between flushes, a bunch of pods from backoffQ might break activeQ priorities and slow down scheduling for the pods that are ready to go.
407+
Moving the pod popping from the backoffQ to the existing `flushBackoffQCompleted` function (which already periodically moves pods to the activeQ) avoids changing `PreEnqueue` behavior, but it has some downsides.
408+
Because flushing runs every second, it would be needed to pop more pods when the activeQ is empty.
409+
This requires figuring out how many pods to pop, either by making it configurable or calculating it.
410+
Also, if schedulable pods show up in the activeQ between flushes, a bunch of pods from the backoffQ might break activeQ priorities and slow down scheduling for the pods that are ready to go.

keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/kep.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,18 @@ owning-sig: sig-scheduling
66
status: implementable
77
creation-date: 2025-02-06
88
reviewers:
9-
-
9+
- dom4ha
10+
- sanposhiho
1011
approvers:
11-
-
12+
- alculquicondor
1213

13-
stage: alpha
14+
stage: beta
1415

1516
latest-milestone: "v1.33"
1617

1718
# The milestone at which this feature was, or is targeted to be, at each stage.
1819
milestone:
19-
alpha: "v1.33"
20+
beta: "v1.33"
2021

2122
# The following PRR answers are required at alpha release
2223
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)