|
| 1 | +# Scheduling queue in kube-scheduler |
| 2 | + |
| 3 | +Queueing mechanism is an integral part of the scheduler. It allows the scheduler |
| 4 | +to pick the most suitable pod for the next scheduling cycle. Given a pod can |
| 5 | +specify various conditions that have to be met at the time of scheduling, |
| 6 | +such as existence of a persistent volume, compliance with pod anti-affinity rules |
| 7 | +or toleration of node taints, the mechanism needs to be able to postpone |
| 8 | +the scheduling action until the cluster may meet all the conditions for |
| 9 | +the successful scheduling. The mechanism relies on three queues: |
| 10 | +- active ([activeQ](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L130)): providing pods for immediate scheduling |
| 11 | +- unschedulable ([unschedulableQ](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L135)): for parking pods which are waiting for certain condition(s) to happen |
| 12 | +- backoff ([podBackoffQ](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L133)): exponentially postponing pods which failed |
| 13 | + to be scheduled (e.g. volume still getting created) but are expected to get scheduled eventually. |
| 14 | + |
| 15 | +In addition, the scheduling queue mechanism has two periodical flushing goroutines |
| 16 | +running in the background responsible for moving pods to the active queue: |
| 17 | +- [flushUnschedulableQLeftover](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L350): running every 30 seconds moving pods from unschedulable |
| 18 | + queue to allow unschedulable pods that were not moved by any event |
| 19 | + to be retried again. Pod has to stay for at least 30 seconds in the queue to get moved. |
| 20 | + In the worst case it can take up to 60 seconds to have a pod moved. |
| 21 | +- [flushBackoffQCompleted](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L324): running every second moving pods that were backed off |
| 22 | + long enough to the active queue. |
| 23 | + |
| 24 | +Both retry periods for the goroutines are fixed and non-configurable. |
| 25 | +Also, in response to certain events, the scheduler |
| 26 | +move pods from either queue to the active queue (by invoking [MoveAllToActiveOrBackoffQueue](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L493)). |
| 27 | +Example events include a node addition or update, an existing pod being deleted etc. |
| 28 | + |
| 29 | + |
| 30 | + |
| 31 | +## Active queue (heap) |
| 32 | + |
| 33 | +A queue with the highest priority pod at the top by default. The ordering |
| 34 | +can be customized via QueueSort extension point. Newly created pods, with empty `.spec.nodeName`, |
| 35 | +are added to the queue as they come. In each scheduling cycle the scheduler takes |
| 36 | +one pod from the queue and tries to schedule it. In case the scheduling algorithm |
| 37 | +fails (e.g. plugins error, binding error), the pod is moved to the unschedulable queue. |
| 38 | +Or, moved to the backoff queue if a move request was issued at the same or newer time. |
| 39 | +The move request signals a move of pods from unschedulable to active, respectively backoff queue. |
| 40 | +If a pod is scheduled without an error, it is removed from all queues. |
| 41 | + |
| 42 | +## Backoff queue (heap) |
| 43 | +Queue keeping pods in a waiting state to avoid continuous retries. Queue ordering |
| 44 | +keeps a pod with the shortest backoff timeout at the top. The more times a pod gets |
| 45 | +backed off, the longer it takes for the pod to re-enter the active queue. The backoff |
| 46 | +timeout grows exponentially with each failed scheduling attempt until it reaches its maximum. |
| 47 | +Scheduler allows to configure initial backoff (set to 1 second by default) and maximum |
| 48 | +backoff (set to 10 seconds by default). A pod can get to the backoff queue |
| 49 | +when a move request (see below) is issued. |
| 50 | + |
| 51 | +As an example a pod with 3 failed attempts gets the target backoff timeout |
| 52 | +set to curTime + 2s^3 (8s). With 5 failed attempts the timeout gets set to curTime +2s^5 (32s). |
| 53 | +In case the maximum backoff is too low (e.g. the default 10s), a pod can get to the active |
| 54 | +queue too often. So it’s recommended to configure the maximum backoff to fit the workloads |
| 55 | +so the pods stay in the backoff queue long enough to avoid flooding the active queue |
| 56 | +with pods failing too often to be scheduled. |
| 57 | + |
| 58 | +## Unschedulable queue (map) |
| 59 | +Queue keeping all pods that failed to be scheduled and were not subject to a move request. |
| 60 | +Pods are kept in the queue until a move request is issued. |
| 61 | + |
| 62 | +## Moving request |
| 63 | + |
| 64 | +Moving request triggers an event responsible for moving pods from |
| 65 | +unschedulable queue to either the active or the backoff queue. Different cluster |
| 66 | +events can asynchronously trigger a moving request and make unschedulable |
| 67 | +pods (that were tried before) schedulable again. The events currently include |
| 68 | +changes in pods, nodes, services, PVs, PVCs, storage classes and CSI nodes. |
| 69 | + |
| 70 | +It’s possible that a pod fails to be scheduled while a moving request gets issued. |
| 71 | +Due to this event, the pod might now be schedulable and the following mechanism |
| 72 | +allows such pod to be retried. Every moving request operation stores the current |
| 73 | +scheduling cycle under [moveRequestCycle](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/internal/queue/scheduling_queue.go#L523) variable. After a pod fails scheduling, |
| 74 | +it is regularly put in the unschedulable queue. Unless moveRequestCycle |
| 75 | +is the current scheduling cycle, in which case the pod takes a shortcut |
| 76 | +and gets moved right under the backoff queue. |
| 77 | + |
| 78 | +**Examples**: |
| 79 | +- When a pod is scheduled, some pods in the unschedulable queue with matching |
| 80 | + affinity can be made schedulable. If matching affinity is the only required |
| 81 | + condition for scheduling, issuing a moving request for those pods will allow |
| 82 | + them to get finally scheduled. |
| 83 | +- A pod is getting processed by filter plugins which give no nodes left for scheduling. |
| 84 | + Meantime an asynchronous moving request gets issued as a reaction on a new node event. |
| 85 | + Moving the pod under the backoff queue will allow the pod to be moved sooner |
| 86 | + into the active queue and check if the new node is eligible for scheduling. |
| 87 | + |
| 88 | +## Metrics |
| 89 | + |
| 90 | +The scheduling queue populates two metrics: |
| 91 | +[pending_pods](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/metrics/metrics.go#L83-L89) and |
| 92 | +[queue_incoming_pods_total](https://github.com/kubernetes/kubernetes/blob/4cc1127e9251fff364d5c77e2a9a9c3ad42383ab/pkg/scheduler/metrics/metrics.go#L141-L147). |
| 93 | +All three queues count how many pods are pending in each queue and how many |
| 94 | +times a pod was enqueued into each queue. Including which event was responsible |
| 95 | +for the enqueueing. The events can include failed scheduling attempts, |
| 96 | +pod finishing backoff, node added, service updated, etc. The metrics allow us |
| 97 | +to see how many pods are present in each queue. Allowing to see how often pods |
| 98 | +are unschedulable, what’s the scheduler throughput, or which events are moving |
| 99 | +the pods from one queue to another most often. |
0 commit comments