Skip to content

Commit 8e0f8ac

Browse files
committed
sig-scheduling: Scheduling queue in kube-scheduler
1 parent 034c8b1 commit 8e0f8ac

File tree

2 files changed

+99
-0
lines changed

2 files changed

+99
-0
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Scheduling queue in kube-scheduler
2+
3+
Queueing mechanism is an integral part of the scheduler. It allows the scheduler
4+
to pick the most suitable pod for the next scheduling cycle. Given a pod can
5+
specify various conditions that have to be met at the time of scheduling,
6+
such as existence of a persistent volume, compliance with pod anti-affinity rules
7+
or toleration of node taints, the mechanism needs to be able to postpone
8+
the scheduling action until the cluster may meet all the conditions for
9+
the successful scheduling. The mechanism relies on three queues:
10+
- active ([activeQ](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L130)): providing pods for immediate scheduling
11+
- unschedulable ([unschedulableQ](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L135)): for parking pods which are waiting for certain condition(s) to happen
12+
- backoff ([podBackoffQ](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L133)): exponentially postponing pods which failed
13+
to be scheduled (e.g. volume still getting created) but are expected to get scheduled eventually.
14+
15+
In addition, the scheduling queue mechanism has two periodical flushing goroutines
16+
running in the background responsible for moving pods to the active queue:
17+
- [flushUnschedulableQLeftover](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L350): running every 30 seconds moving pods from unschedulable
18+
queue to allow unschedulable pods that were not moved by any event
19+
to be retried again. Pod has to stay for at least 30 seconds in the queue to get moved.
20+
In the worst case it can take up to 60 seconds to have a pod moved.
21+
- [flushBackoffQCompleted](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L324): running every second moving pods that were backed off
22+
long enough to the active queue.
23+
24+
Both retry periods for the goroutines are fixed and non-configurable.
25+
Also, in response to certain events, the scheduler
26+
move pods from either queue to the active queue (by invoking [MoveAllToActiveOrBackoffQueue](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L493)).
27+
Example events include a node addition or update, an existing pod being deleted etc.
28+
29+
![Pods moving between queues](scheduling_queues.png "Pods moving between queues")
30+
31+
## Active queue (heap)
32+
33+
A queue with the highest priority pod at the top by default. The ordering
34+
can be customized via QueueSort extension point. Newly created pods, with empty `.spec.nodeName`,
35+
are added to the queue as they come. In each scheduling cycle the scheduler takes
36+
one pod from the queue and tries to schedule it. In case the scheduling algorithm
37+
fails (e.g. plugins error, binding error), the pod is moved to the unschedulable queue.
38+
Or, moved to the backoff queue if a move request was issued at the same or newer time.
39+
The move request signals a move of pods from unschedulable to active, respectively backoff queue.
40+
If a pod is scheduled without an error, it is removed from all queues.
41+
42+
## Backoff queue (heap)
43+
Queue keeping pods in a waiting state to avoid continuous retries. Queue ordering
44+
keeps a pod with the shortest backoff timeout at the top. The more times a pod gets
45+
backed off, the longer it takes for the pod to re-enter the active queue. The backoff
46+
timeout grows exponentially with each failed scheduling attempt until it reaches its maximum.
47+
Scheduler allows to configure initial backoff (set to 1 second by default) and maximum
48+
backoff (set to 10 seconds by default). A pod can get to the backoff queue
49+
when a move request (see below) is issued.
50+
51+
As an example a pod with 3 failed attempts gets the target backoff timeout
52+
set to curTime + 2s^3 (8s). With 5 failed attempts the timeout gets set to curTime +2s^5 (32s).
53+
In case the maximum backoff is too low (e.g. the default 10s), a pod can get to the active
54+
queue too often. So it’s recommended to configure the maximum backoff to fit the workloads
55+
so the pods stay in the backoff queue long enough to avoid flooding the active queue
56+
with pods failing too often to be scheduled.
57+
58+
## Unschedulable queue (map)
59+
Queue keeping all pods that failed to be scheduled and were not subject to a move request.
60+
Pods are kept in the queue until a move request is issued.
61+
62+
## Moving request
63+
64+
Moving request triggers an event responsible for moving pods from
65+
unschedulable queue to either the active or the backoff queue. Different cluster
66+
events can asynchronously trigger a moving request and make unschedulable
67+
pods (that were tried before) schedulable again. The events currently include
68+
changes in pods, nodes, services, PVs, PVCs, storage classes and CSI nodes.
69+
70+
It’s possible that a pod fails to be scheduled while a moving request gets issued.
71+
Due to this event, the pod might now be schedulable and the following mechanism
72+
allows such pod to be retried. Every moving request operation stores the current
73+
scheduling cycle under [moveRequestCycle](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L523) variable. After a pod fails scheduling,
74+
it is regularly put in the unschedulable queue. Unless moveRequestCycle
75+
is the current scheduling cycle, in which case the pod takes a shortcut
76+
and gets moved right under the backoff queue.
77+
78+
**Examples**:
79+
- When a pod is scheduled, some pods in the unschedulable queue with matching
80+
affinity can be made schedulable. If matching affinity is the only required
81+
condition for scheduling, issuing a moving request for those pods will allow
82+
them to get finally scheduled.
83+
- A pod is getting processed by filter plugins which give no nodes left for scheduling.
84+
Meantime an asynchronous moving request gets issued as a reaction on a new node event.
85+
Moving the pod under the backoff queue will allow the pod to be moved sooner
86+
into the active queue and check if the new node is eligible for scheduling.
87+
88+
## Metrics
89+
90+
The scheduling queue populates two metrics:
91+
[pending_pods](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/metrics/metrics.go#L83-L89) and
92+
[queue_incoming_pods_total](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/metrics/metrics.go#L141-L147).
93+
All three queues count how many pods are pending in each queue and how many
94+
times a pod was enqueued into each queue. Including which event was responsible
95+
for the enqueueing. The events can include failed scheduling attempts,
96+
pod finishing backoff, node added, service updated, etc. The metrics allow us
97+
to see how many pods are present in each queue. Allowing to see how often pods
98+
are unschedulable, what’s the scheduler throughput, or which events are moving
99+
the pods from one queue to another most often.
41.1 KB
Loading

0 commit comments

Comments
 (0)