Skip to content

Commit c35321d

Browse files
committed
fix based on suggestions
1 parent 7189bfe commit c35321d

File tree

1 file changed

+86
-82
lines changed
  • content/en/blog/_posts/2024-12-12-scheduler-queueinghint

1 file changed

+86
-82
lines changed

content/en/blog/_posts/2024-12-12-scheduler-queueinghint/index.md

Lines changed: 86 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -9,109 +9,113 @@ Author: >
99
---
1010

1111
The Kubernetes [scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/) is the core
12-
component that decides which node any new Pods should run on.
13-
Basically, it schedules Pods **one by one**,
14-
and thus the larger your cluster is, the more crucial the throughput of the scheduler is.
12+
component that selects the nodes on which new Pods run. The scheduler processes
13+
these new Pods **one by one**. Therefore, the larger your clusters, the more important
14+
the throughput of the scheduler becomes.
1515

16-
For the Kubernetes project, the throughput of the scheduler has been an eternal challenge
17-
over the years, SIG Scheduling have been putting effort to improve the scheduling throughput by many enhancements.
18-
19-
In this blog post, I'll introduce a recent major improvement in the scheduler: a new
16+
Over the years, the Kubernetes project (and SIG Scheduling in particular) has improved the throughput
17+
of the scheduler in multiple enhancements. This blog post describes a major improvement to the
18+
scheduler in Kubernetes v1.32: a
2019
[scheduling context element](/docs/concepts/scheduling-eviction/scheduling-framework/#extension-points)
21-
named _QueueingHint_.
22-
We'll go through the explanation of the basic background knowledge of the scheduler,
23-
and how QueueingHint improves our scheduling throughput.
20+
named _QueueingHint_. This page provides background knowledge of the scheduler and explains how
21+
QueueingHint improves scheduling throughput.
2422

2523
## Scheduling queue
2624

27-
The scheduler stores all unscheduled Pods in an internal component that we - SIG Scheduling -
28-
call the _scheduling queue_.
25+
The scheduler stores all unscheduled Pods in an internal component called the _scheduling queue_.
2926

30-
The scheduling queue is composed of three data structures: _ActiveQ_, _BackoffQ_ and _Unschedulable Pod Pool_.
31-
- ActiveQ: It holds newly created Pods or Pods which are ready to be retried for scheduling.
32-
- BackoffQ: It holds Pods which are ready to be retried, but are waiting for a backoff period, which depends on the number of times the scheduled attempted to schedule the Pod.
33-
- Unschedulable Pod Pool: It holds Pods which should not be scheduled for now, because they have a Scheduling Gate or because the scheduler attempted to schedule them and nothing has changed in the cluster that could make the Pod schedulable.
27+
The scheduling queue consists of the following data structures:
28+
- **ActiveQ**: holds newly created Pods or Pods that are ready to be retried for scheduling.
29+
- **BackoffQ**: holds Pods that are ready to be retried but are waiting for a backoff period to end. The
30+
backoff period depends on the number of times the failed scheduler attempted to schedule the Pod.
31+
- **Unschedulable Pod Pool**: holds Pods that the scheduler won't attempt to schedule for one of the
32+
following reasons:
33+
- The scheduler previously attempted, and was unable to, schedule the Pods. Since that attempt, the cluster
34+
hasn't changed in a way that makes those Pods schedulable.
35+
- The Pods are blocked from entering the scheduling cycles by PreEnqueue Plugins,
36+
for example, they have a [scheduling gate](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/#configuring-pod-schedulinggates),
37+
and get blocked by the scheduling gate plugin.
3438

3539
## Scheduling framework and plugins
3640

3741
The Kubernetes scheduler is implemented following the Kubernetes
3842
[scheduling framework](/docs/concepts/scheduling-eviction/scheduling-framework/).
3943

40-
And, each scheduling requirements are implemented as a plugin.
44+
And, all scheduling features are implemented as plugins
4145
(e.g., [Pod affinity](/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
42-
is implemented in the `PodAffinity` plugin.)
43-
44-
The first phase, called the _scheduling cycle_, takes Pods from activeQ **one by one**, runs all plugins' logic,
45-
and lastly decides in which Node to run the Pod, or concludes that the Pod cannot go to anywhere for now.
46-
47-
If the scheduling is successful, the second phase, called the _binding cycle_, binds the Pod with
48-
the Node by communicating the decision to the API server.
49-
But, if it turns out that the Pod cannot go to anywhere during the scheduling cycle,
50-
the binding cycle isn't executed; instead the Pod is moved back to the scheduling queue.
51-
Although there are some exceptions, unscheduled Pods enter the _unschedulable pod pool_.
52-
53-
Pods in Unschedulable Pod Pool are moved to ActiveQ/BackoffQ
54-
only when Scheduling Queue identifies changes in the cluster that might be schedulable if we retry the scheduling.
55-
56-
That is a crucial step because scheduling cycle is performed for Pods one by one -
57-
if we didn't have Unschedulable Pod Pool and kept retrying the scheduling of any Pods,
58-
multiple scheduling cycles would be wasted for Pods that have no chance to be scheduled.
59-
60-
Then, how do they decide when to move a Pod back into the ActiveQ? How do they notice that Pods might be schedulable now?
61-
Here QueueingHints come into play.
62-
63-
## QueueingHint
64-
65-
QueueingHint is callback function per plugin to notice an object addition/update/deletion in the cluster (we call them cluster events)
66-
that may make Pods schedulable.
67-
68-
Let's say the Pod `pod-a` has a required Pod affinity, and got rejected in scheduling cycle by the `PodAffinity` plugin
69-
because no Node has any Pod matching the Pod affinity specification for `pod-a`.
70-
71-
![pod-a got rejected by PodAffinity](./queueinghint1.svg)
72-
73-
When an unscheduled Pod is put into the unschedulable pod pool, the scheduling queue
74-
records which plugins caused the scheduling failure of the Pod.
75-
In this example, scheduling queue notes that `pod-a` was rejected by `PodAffinity`.
76-
77-
`pod-a` will never be schedulable until the PodAffinity failure is resolved somehow.
78-
The scheduling queue uses the queueing hints from plugins that rejected the Pod, which is `PodAffinity` in the example.
79-
80-
A QueueingHint subscribes to a particular kind of cluster event and make a decision whether an incoming event could make the Pod schedulable.
81-
Thinking about when PodAffinity failure could be resolved,
82-
one possible scenario is that an existing Pod gets a new label which matches with `pod-a`'s PodAffinity.
83-
84-
The `PodAffinity` plugin's `QueueingHint` callback checks on all Pod updates happening in the cluster,
85-
and when it catches such update, the scheduling queue moves `pod-a` to either ActiveQ or BackoffQ.
86-
87-
![pod-a is moved by PodAffinity QueueingHint](./queueinghint2.svg)
88-
89-
We actually already had a similar functionality (called `preCheck`) inside the scheduling queue,
90-
which filters out cluster events based on Kubernetes core scheduling constraints -
91-
for example, filtering out node related events when nodes aren't ready.
92-
93-
But, it's not ideal because this hard-coded `preCheck` refers to in-tree plugins logic,
94-
and it causes issues for custom plugins (for example: [#110175](https://github.com/kubernetes/kubernetes/issues/110175)).
46+
is implemented in the `InterPodAffinity` plugin.)
47+
48+
The scheduler processes pending Pods in phases called _cycles_ as follows:
49+
1. **Scheduling cycle**: the scheduler takes pending Pods from the activeQ component of the scheduling
50+
queue _one by one_. For each Pod, the scheduler runs the filtering/scoring logic from every scheduling plugin. The
51+
scheduler then decides on the best node for the Pod, or decides that the Pod can't be scheduled at that time.
52+
53+
If the scheduler decides that a Pod can't be scheduled, that Pod enters the Unschedulable Pod Pool
54+
component of the scheduling queue. However, if the scheduler decides to place the Pod on a node,
55+
the next cycle executes for that Pod.
56+
57+
1. **Binding cycle**: the scheduler communicates the node placement decision to the Kubernetes API
58+
server. The Pod is then bound to the selected node.
59+
60+
Aside from some exceptions, most unscheduled Pods enter the unschedulable pod pool after each scheduling
61+
cycle. The Unschedulable Pod Pool component is crucial because of how the scheduling cycle processes Pods one by one. If the scheduler had to constantly retry placing unschedulable Pods instead of offloading those
62+
Pods to the Unschedulable Pod Pool, multiple scheduling cycles would be wasted on those Pods.
63+
64+
## Improvements to retrying Pod scheduling with QueuingHint
65+
66+
Unschedulable Pods only move back into the ActiveQ or BackoffQ components of the scheduler
67+
queue if changes in the cluster might allow the scheduler to place those Pods on nodes.
68+
69+
Prior to v1.32, each plugin registered which cluster changes could solve their failures, an object creation, update, or deletion in the cluster (called _cluster events_),
70+
with `EnqueueExtensions` (`EventsToRegister`),
71+
and the scheduling queue retries a pod with an event that is registered by a plugin that rejected the pod in a previous scheduling cycle.
72+
73+
Additionally, we had an internal feature called `preCheck`, which helped further filtering of events for efficiency, based on Kubernetes core scheduling constraints;
74+
For example, `preCheck` could filter out node-related events when the node status is `NotReady`.
75+
76+
However, we had two issues for those approaches:
77+
- Requeueing with events was too broad, could lead to scheduling retries for no reason.
78+
- For example, a new scheduled Pod _might_ solve the `InterPodAffinity`'s failure, but not all of them do,
79+
for example, if a new Pod is created, but without a label matching `InterPodAffinity` of the unschedulable pod, the pod wouldn't be schedulable.
80+
- `preCheck` relied on the logic of in-tree plugins and caused some issues for custom plugins,
81+
like in issue [#110175](https://github.com/kubernetes/kubernetes/issues/110175).
82+
83+
Here QueueingHints come into play;
84+
a QueueingHint subscribes to a particular kind of cluster event, and make a decision about whether each incoming event could make the Pod schedulable.
85+
86+
For example, consider a Pod named `pod-a` that has a required Pod affinity. `pod-a` was rejected in
87+
the scheduling cycle by the `InterPodAffinity` plugin because no node had an existing Pod that matched
88+
the Pod affinity specification for `pod-a`.
89+
90+
![pod-a got rejected by InterPodAffinity](./queueinghint1.svg)
91+
92+
`pod-a` moves into the Unschedulable Pod Pool. The scheduling queue records which plugin caused
93+
the scheduling failure for the Pod. For `pod-a`, the scheduling queue records that the `InterPodAffinity`
94+
plugin rejected the Pod.
95+
96+
`pod-a` will never be schedulable until the InterPodAffinity failure is resolved. The `InterPodAffinity` plugin's
97+
`QueuingHint` callback function checks every Pod update that occurs in the cluster. If, for example,
98+
a Pod gets a label update that matches the Pod affinity requirement of `pod-a`, the `InterPodAffinity`
99+
plugin's `QueuingHint` prompts the scheduling queue to move `pod-a` back into the ActiveQ or
100+
the BackoffQ component.
101+
102+
![pod-a is moved by InterPodAffinity QueueingHint](./queueinghint2.svg)
95103

96104
## QueueingHint's history and what's new in v1.32
97105

98106
Within SIG Scheduling, we have been working on the development of QueueingHint since
99107
Kubernetes v1.28.
100108

101-
QueueingHint is not something user-facing, but we implemented a feature gate (`SchedulerQueueingHints`) as a safety net,
102-
...which actually saved our life soon.
103-
104-
In v1.28, we implemented QueueingHints with a few in-tree plugins experimentally,
105-
and made the feature gate enabled by default.
106-
107-
But, users reported the memory leak issue, and consequently we disabled the feature gate in the patch release of v1.28.
109+
While QueuingHint isn't user-facing, we implemented the `SchedulerQueueingHints` feature gate as a
110+
safety measure when we originally added this feature. In v1.28, we implemented QueueingHints with a
111+
few in-tree plugins experimentally, and made the feature gate enabled by default.
108112

109-
In v1.28 - v1.31, we kept working on the QueueingHint implementation within the rest of in-tree plugins,
110-
and having bug fixes.
113+
However, users reported a memory leak issue, and consequently we disabled the feature gate in a
114+
patch release of v1.28. From v1.28 until v1.31, we kept working on the QueueingHint implementation
115+
within the rest of the in-tree plugins and fixing bugs.
111116

112-
And, at v1.32, we will make this feature enabled by default again;
113-
we finished implementing QueueingHints with all plugins,
114-
and also identified the cause of the memory leak, finally!
117+
In v1.32, we will make this feature enabled by default again. We finished implementing QueueingHints
118+
in all plugins and also identified the cause of the memory leak!
115119

116120
## Getting involved
117121

0 commit comments

Comments
 (0)