Skip to content

Commit 58bc648

Browse files
committed
KEP-5142: Pop pod from backoffQ when activeQ is empty
1 parent 930d228 commit 58bc648

File tree

3 files changed

+407
-0
lines changed

3 files changed

+407
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 5142
2+
alpha:
3+
approver: "@wojtek-t"
Lines changed: 377 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,377 @@
1+
# KEP-5142: Pop pod from backoffQ when activeQ is empty
2+
3+
<!--
4+
This is the title of your KEP. Keep it short, simple, and descriptive. A good
5+
title can help communicate what the KEP is and should be considered as part of
6+
any review.
7+
-->
8+
9+
<!--
10+
A table of contents is helpful for quickly jumping to sections of a KEP and for
11+
highlighting any additional information provided beyond the standard KEP
12+
template.
13+
14+
Ensure the TOC is wrapped with
15+
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
16+
tags, and then generate with `hack/update-toc.sh`.
17+
-->
18+
19+
<!-- toc -->
20+
- [Release Signoff Checklist](#release-signoff-checklist)
21+
- [Summary](#summary)
22+
- [Motivation](#motivation)
23+
- [Goals](#goals)
24+
- [Non-Goals](#non-goals)
25+
- [Proposal](#proposal)
26+
- [Risks and Mitigations](#risks-and-mitigations)
27+
- [Scheduling throughput might be affected](#scheduling-throughput-might-be-affected)
28+
- [Backoff won't be working as natural rate limiter in case of errors](#backoff-wont-be-working-as-natural-rate-limiter-in-case-of-errors)
29+
- [One pod in backoffQ could starve the others](#one-pod-in-backoffq-could-starve-the-others)
30+
- [Design Details](#design-details)
31+
- [Popping from backoffQ in activeQ's pop()](#popping-from-backoffq-in-activeqs-pop)
32+
- [Notifying activeQ condition when new pod appears in backoffQ](#notifying-activeq-condition-when-new-pod-appears-in-backoffq)
33+
- [Calling PreEnqueue for backoffQ](#calling-preenqueue-for-backoffq)
34+
- [Test Plan](#test-plan)
35+
- [Prerequisite testing updates](#prerequisite-testing-updates)
36+
- [Unit tests](#unit-tests)
37+
- [Integration tests](#integration-tests)
38+
- [e2e tests](#e2e-tests)
39+
- [Graduation Criteria](#graduation-criteria)
40+
- [Alpha](#alpha)
41+
- [Beta](#beta)
42+
- [GA](#ga)
43+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
44+
- [Version Skew Strategy](#version-skew-strategy)
45+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
46+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
47+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
48+
- [Monitoring Requirements](#monitoring-requirements)
49+
- [Dependencies](#dependencies)
50+
- [Scalability](#scalability)
51+
- [Troubleshooting](#troubleshooting)
52+
- [Implementation History](#implementation-history)
53+
- [Drawbacks](#drawbacks)
54+
- [Alternatives](#alternatives)
55+
- [Move pods in flushBackoffQCompleted when activeQ is empty](#move-pods-in-flushbackoffqcompleted-when-activeq-is-empty)
56+
<!-- /toc -->
57+
58+
## Release Signoff Checklist
59+
60+
<!--
61+
**ACTION REQUIRED:** In order to merge code into a release, there must be an
62+
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
63+
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
64+
of the targeted release**.
65+
66+
For enhancements that make changes to code or processes/procedures in core
67+
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
68+
Signoff checklist to be completed.
69+
70+
Check these off as they are completed for the Release Team to track. These
71+
checklist items _must_ be updated for the enhancement to be released.
72+
-->
73+
74+
Items marked with (R) are required *prior to targeting to a milestone / release*.
75+
76+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
77+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
78+
- [ ] (R) Design details are appropriately documented
79+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
80+
- [ ] e2e Tests for all Beta API Operations (endpoints)
81+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
82+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
83+
- [ ] (R) Graduation criteria is in place
84+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
85+
- [ ] (R) Production readiness review completed
86+
- [ ] (R) Production readiness review approved
87+
- [ ] "Implementation History" section is up-to-date for milestone
88+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
89+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
90+
91+
<!--
92+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
93+
-->
94+
95+
[kubernetes.io]: https://kubernetes.io/
96+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
97+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
98+
[kubernetes/website]: https://git.k8s.io/website
99+
100+
## Summary
101+
102+
This KEP proposes improving scheduling queue behavior by popping pods from backoffQ when activeQ is empty.
103+
This would allow to increase utilization of kube-scheduler cycles as well as reduce waiting time for pending pods
104+
that were previously unschedulable.
105+
106+
## Motivation
107+
108+
When activeQ is empty, kube-scheduler is wasting its potential of scheduling pods.
109+
In scenarios when pods are waiting, but in backoffQ,
110+
kube-scheduler should have a possibility of scheduling those pods even if the backoff is not completed.
111+
112+
### Goals
113+
114+
- Improve scheduling throughput and kube-scheduler utilization when activeQ is empty, but pods are waiting in backoffQ.
115+
- Run PreEnqueue plugins when putting pod into backoffQ.
116+
117+
### Non-Goals
118+
119+
- Refactor scheduling queue by changing backoff logic or merging activeQ with backoffQ.
120+
121+
## Proposal
122+
123+
At the beginning of scheduling cycle, pod is popped from activeQ.
124+
If activeQ is empty, it waits until a pod is placed into the queue.
125+
This KEP proposes to pop the pod from backoffQ when activeQ is empty.
126+
127+
To ensure the PreEnqueue is called for each pod taken into scheduling cycle,
128+
PreEnqueue plugins would be called before putting pods into backoffQ.
129+
It won't be done again when moving pods from backoffQ to activeQ.
130+
131+
### Risks and Mitigations
132+
133+
#### Scheduling throughput might be affected
134+
135+
TODO
136+
137+
#### Backoff won't be working as natural rate limiter in case of errors
138+
139+
In case of API calls errors (e.g. network issues), backoffQ allows to limit number of retries in a short term.
140+
This proposal will take those pods earlier, leading to losing this mechanism.
141+
142+
After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748),
143+
it will be possible to distinguish pods backing off because of errors from those backing off because of unschedulable attempt.
144+
This information could be used when popping, by filtering only the pods that are from unschedulable attempt or even splitting backoffQ.
145+
146+
#### One pod in backoffQ could starve the others
147+
148+
TODO
149+
150+
## Design Details
151+
152+
### Popping from backoffQ in activeQ's pop()
153+
154+
To achieve the goal, activeQ's pop() method needs to be changed:
155+
1. If activeQ is empty, then instead of waiting on condition, popping from backoffQ is tried.
156+
2. If backoffQ is empty, then pop() is waiting on condition as previously.
157+
3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.
158+
159+
### Notifying activeQ condition when new pod appears in backoffQ
160+
161+
Pods might appear in backoffQ while pop() is hanging on point 2.
162+
That's why it will be required to call broadcast() on condition after adding a pod to backoffQ.
163+
164+
We could eventually want to move backoffQ under activeQ's lock, but it's out of scope of this KEP.
165+
166+
### Calling PreEnqueue for backoffQ
167+
168+
PreEnqueue plugins have to be called for every pod before they are taken to scheduling cycle.
169+
Initially, those plugins were called before moving pod to activeQ.
170+
With this proposal, PreEnqueue will need to be called before moving pod to backoffQ
171+
and those calls need to be skipped for the pods that are moved later from backoffQ to activeQ.
172+
At moveToActiveQ level, these two paths could be distinguished by checking if event is equal to `BackoffComplete`.
173+
174+
### Test Plan
175+
176+
[x] I/we understand the owners of the involved components may require updates to
177+
existing tests to make this code solid enough prior to committing the changes necessary
178+
to implement this enhancement.
179+
180+
##### Prerequisite testing updates
181+
182+
##### Unit tests
183+
184+
- `k8s.io/kubernetes/pkg/scheduler/backend/queue`: `2025-02-06` - `91.4`
185+
186+
##### Integration tests
187+
188+
- [`k8s.io/kubernetes/test/integration/scheduler/queueing`](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler/queueing) - add test cases covering the scenario.
189+
- [scheduler_perf](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf) - add test cases measuring performance in this scenario.
190+
191+
##### e2e tests
192+
193+
Feature is scoped within kube-scheduler internally, so there is no interaction between other components.
194+
Whole feature should be already covered by integration tests.
195+
196+
### Graduation Criteria
197+
198+
#### Alpha
199+
200+
- Feature implemented behind a feature flag.
201+
- All tests from [Test Plan](#test-plan) implemented.
202+
203+
#### Beta
204+
205+
- Gather feedback from users and fix reported bugs.
206+
- Change the feature flag to be enabled by default.
207+
208+
#### GA
209+
210+
- Gather feedback from users and fix reported bugs.
211+
212+
### Upgrade / Downgrade Strategy
213+
214+
**Upgrade**
215+
216+
During the alpha period, users have to enable the feature gate `PopBackoffQWhenEmptyActiveQ` to opt in this feature.
217+
This is purely in-memory feature for kube-scheduler, so no special actions are required outside the scheduler.
218+
219+
**Downgrade**
220+
221+
Users need to disable the feature gate.
222+
223+
### Version Skew Strategy
224+
225+
This is purely in-memory feature for kube-scheduler, and hence no version skew strategy.
226+
227+
## Production Readiness Review Questionnaire
228+
229+
### Feature Enablement and Rollback
230+
231+
###### How can this feature be enabled / disabled in a live cluster?
232+
233+
- [x] Feature gate (also fill in values in `kep.yaml`)
234+
- Feature gate name: `PopBackoffQWhenEmptyActiveQ`
235+
- Components depending on the feature gate: kube-scheduler
236+
237+
###### Does enabling the feature change any default behavior?
238+
239+
Pods that are backoffQ might be scheduled earlier when activeQ is empty.
240+
241+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
242+
243+
Yes.
244+
The feature can be disabled in Alpha and Beta versions
245+
by restarting kube-scheduler with the feature-gate off.
246+
247+
###### What happens if we reenable the feature if it was previously rolled back?
248+
249+
The scheduler again starts to pop pods from backoffQ when activeQ is empty.
250+
251+
###### Are there any tests for feature enablement/disablement?
252+
253+
Given it's purely in-memory feature and enablement/disablement requires restarting the component (to change the value of feature flag),
254+
having feature tests is enough.
255+
256+
### Rollout, Upgrade and Rollback Planning
257+
258+
<!--
259+
This section must be completed when targeting beta to a release.
260+
-->
261+
262+
###### How can a rollout or rollback fail? Can it impact already running workloads?
263+
264+
The partly failure in the rollout isn't there because the scheduler is the only component to rollout this feature.
265+
But, if upgrading the scheduler itself fails somehow, new Pods won't be scheduled anymore,
266+
while Pods, which are already scheduled, won't be affected in any case.
267+
268+
###### What specific metrics should inform a rollback?
269+
270+
Abnormal values of metrics related to scheduling queue, meaning pods are stuck in activeQ:
271+
- `scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable. This could mean that pods from backoffQ are taken instead of those from activeQ.
272+
- `scheduler_pending_pods` metric with `active` label is not decreasing, while with `backoff` is almost constant.
273+
- `scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
274+
275+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
276+
277+
No. This feature is a in-memory feature of the scheduler
278+
and thus calculations start from the beginning every time the scheduler is restarted.
279+
So, just upgrading it and upgrade->downgrade->upgrade are both the same.
280+
281+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
282+
283+
No
284+
285+
### Monitoring Requirements
286+
287+
###### How can an operator determine if the feature is in use by workloads?
288+
289+
This feature is used during scheduling when activeQ is empty and if the feature gate is enabled.
290+
291+
###### How can someone using this feature know that it is working for their instance?
292+
293+
N/A
294+
295+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
296+
297+
In the default scheduler, we should see the throughput around 100-150 pods/s ([ref](https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Scheduler&metricname=LoadSchedulingThroughput&TestName=load)), and this feature shouldn't bring any regression there.
298+
299+
Based on that `schedule_attempts_total` shouldn't be less than 100 in a second.
300+
301+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
302+
303+
- [x] Metrics
304+
- Metric name:
305+
- `schedule_attempts_total`
306+
- `scheduler_schedule_attempts_total` with `scheduled` label
307+
- `scheduler_pending_pods` with `active` and `backoff` labels
308+
- `scheduler_pod_scheduling_sli_duration_seconds`
309+
- Components exposing the metric: kube-scheduler
310+
311+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
312+
313+
No
314+
315+
### Dependencies
316+
317+
###### Does this feature depend on any specific services running in the cluster?
318+
319+
No
320+
321+
### Scalability
322+
323+
###### Will enabling / using this feature result in any new API calls?
324+
325+
No
326+
327+
###### Will enabling / using this feature result in introducing new API types?
328+
329+
No
330+
331+
###### Will enabling / using this feature result in any new calls to the cloud provider?
332+
333+
No
334+
335+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
336+
337+
No
338+
339+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
340+
341+
No
342+
343+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
344+
345+
No
346+
347+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
348+
349+
No
350+
351+
### Troubleshooting
352+
353+
###### How does this feature react if the API server and/or etcd is unavailable?
354+
355+
N/A
356+
357+
###### What are other known failure modes?
358+
359+
Unknown
360+
361+
###### What steps should be taken if SLOs are not being met to determine the problem?
362+
363+
## Implementation History
364+
365+
- 6th Feb 2025: The initial KEP is submitted.
366+
367+
## Drawbacks
368+
369+
<!--
370+
Why should this KEP _not_ be implemented?
371+
-->
372+
373+
## Alternatives
374+
375+
### Move pods in flushBackoffQCompleted when activeQ is empty
376+
377+
TODO

0 commit comments

Comments
 (0)