You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/en/blog/_posts/2024-12-12-scheduler-queueinghint/index.md
+16-14Lines changed: 16 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,11 +27,11 @@ The scheduler stores all unscheduled Pods in an internal component called the _s
27
27
The scheduling queue consists of the following data structures:
28
28
-**ActiveQ**: holds newly created Pods or Pods that are ready to be retried for scheduling.
29
29
-**BackoffQ**: holds Pods that are ready to be retried but are waiting for a backoff period to end. The
30
-
backoff period depends on the number of times the failed scheduler attempted to schedule the Pod.
30
+
backoff period depends on the number of unsuccessful scheduling attempts performed by the scheduler on that Pod.
31
31
-**Unschedulable Pod Pool**: holds Pods that the scheduler won't attempt to schedule for one of the
32
32
following reasons:
33
-
- The scheduler previously attempted, and was unable to, schedule the Pods. Since that attempt, the cluster
34
-
hasn't changed in a way that makes those Pods schedulable.
33
+
- The scheduler previously attempted and was unable to schedule the Pods. Since that attempt, the cluster
34
+
hasn't changed in a way that could make those Pods schedulable.
35
35
- The Pods are blocked from entering the scheduling cycles by PreEnqueue Plugins,
36
36
for example, they have a [scheduling gate](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/#configuring-pod-schedulinggates),
37
37
and get blocked by the scheduling gate plugin.
@@ -52,13 +52,13 @@ The scheduler processes pending Pods in phases called _cycles_ as follows:
52
52
53
53
If the scheduler decides that a Pod can't be scheduled, that Pod enters the Unschedulable Pod Pool
54
54
component of the scheduling queue. However, if the scheduler decides to place the Pod on a node,
55
-
the next cycle executes for that Pod.
55
+
the Pod goes to the binding cycle.
56
56
57
57
1.**Binding cycle**: the scheduler communicates the node placement decision to the Kubernetes API
58
-
server. The Pod is then bound to the selected node.
58
+
server. This operation bounds the Pod to the selected node.
59
59
60
60
Aside from some exceptions, most unscheduled Pods enter the unschedulable pod pool after each scheduling
61
-
cycle. The Unschedulable Pod Pool component is crucial because of how the scheduling cycle processes Pods one by one. If the scheduler had to constantly retry placing unschedulable Pods instead of offloading those
61
+
cycle. The Unschedulable Pod Pool component is crucial because of how the scheduling cycle processes Pods one by one. If the scheduler had to constantly retry placing unschedulable Pods, instead of offloading those
62
62
Pods to the Unschedulable Pod Pool, multiple scheduling cycles would be wasted on those Pods.
63
63
64
64
## Improvements to retrying Pod scheduling with QueuingHint
@@ -75,9 +75,9 @@ For example, `preCheck` could filter out node-related events when the node statu
75
75
76
76
However, we had two issues for those approaches:
77
77
- Requeueing with events was too broad, could lead to scheduling retries for no reason.
78
-
- For example, a new scheduled Pod _might_ solve the `InterPodAffinity`'s failure, but not all of them do,
79
-
for example, if a new Pod is created, but without a label matching `InterPodAffinity` of the unschedulable pod, the pod wouldn't be schedulable.
80
-
-`preCheck` relied on the logic of in-tree plugins and caused some issues for custom plugins,
78
+
- A new scheduled Pod _might_ solve the `InterPodAffinity`'s failure, but not all of them do.
79
+
For example, if a new Pod is created, but without a label matching `InterPodAffinity` of the unschedulable pod, the pod wouldn't be schedulable.
80
+
-`preCheck` relied on the logic of in-tree plugins and was not extensible to custom plugins,
81
81
like in issue [#110175](https://github.com/kubernetes/kubernetes/issues/110175).
82
82
83
83
Here QueueingHints come into play;
@@ -87,7 +87,7 @@ For example, consider a Pod named `pod-a` that has a required Pod affinity. `pod
87
87
the scheduling cycle by the `InterPodAffinity` plugin because no node had an existing Pod that matched
88
88
the Pod affinity specification for `pod-a`.
89
89
90
-

90
+
{{< figure src="queueinghint1.svg" alt="A diagram showing the scheduling queue and pod-a rejected by InterPodAffinity plugin" caption="A diagram showing the scheduling queue and pod-a rejected by InterPodAffinity plugin" >}}
91
91
92
92
`pod-a` moves into the Unschedulable Pod Pool. The scheduling queue records which plugin caused
93
93
the scheduling failure for the Pod. For `pod-a`, the scheduling queue records that the `InterPodAffinity`
@@ -100,24 +100,26 @@ Then, if a Pod gets a label update that matches the Pod affinity requirement of
100
100
plugin's `QueuingHint` prompts the scheduling queue to move `pod-a` back into the ActiveQ or
101
101
the BackoffQ component.
102
102
103
-

103
+
{{< figure src="queueinghint2.svg" alt="A diagram showing the scheduling queue and pod-a being moved by InterPodAffinity QueueingHint" caption="A diagram showing the scheduling queue and pod-a being moved by InterPodAffinity QueueingHint" >}}
104
104
105
105
## QueueingHint's history and what's new in v1.32
106
106
107
-
Within SIG Scheduling, we have been working on the development of QueueingHint since
107
+
At SIG Scheduling, we have been working on the development of QueueingHint since
108
108
Kubernetes v1.28.
109
109
110
110
While QueuingHint isn't user-facing, we implemented the `SchedulerQueueingHints` feature gate as a
111
111
safety measure when we originally added this feature. In v1.28, we implemented QueueingHints with a
112
112
few in-tree plugins experimentally, and made the feature gate enabled by default.
113
113
114
-
However, users reported a memory leak issue, and consequently we disabled the feature gate in a
114
+
However, users reported a memory leak, and consequently we disabled the feature gate in a
115
115
patch release of v1.28. From v1.28 until v1.31, we kept working on the QueueingHint implementation
116
116
within the rest of the in-tree plugins and fixing bugs.
117
117
118
-
In v1.32, we will make this feature enabled by default again. We finished implementing QueueingHints
118
+
In v1.32, we made this feature enabled by default again. We finished implementing QueueingHints
119
119
in all plugins and also identified the cause of the memory leak!
120
120
121
+
We thank all the contributors who participated in the development of this feature and those who reported and investigated the earlier issues.
122
+
121
123
## Getting involved
122
124
123
125
These features are managed by Kubernetes [SIG Scheduling](https://github.com/kubernetes/community/tree/master/sig-scheduling).
0 commit comments