Apply comments

macsko · macsko · commit e7ebadc456a8 · 2025-02-10T11:10:23.000Z
diff --git a/keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md b/keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/README.md
@@ -138,39 +138,46 @@ It won't be done again when moving pods from backoffQ to activeQ.
 
 ### Risks and Mitigations
 
-#### Scheduling throughput might be affected
+#### A tiny delay on the first scheduling attempts for newly created pods
 
-While popping from backoffQ, another pod might appear in activeQ ready to be scheduled.
-If the pop operation is short enough, there won't be a visible downgrade in throughput.
-The only concern might be that less pods from activeQ might be taken in some period of time in favor of backoffQ, 
-but that's a user responsibility to create enough amount of pods to be scheduled from activeQ, not to cause this KEP behavior to happen.
+While the scheduler handles a pod directly popping from backoffQ, another pod that should be scheduled before the pod being scheduled now, may appear in activeQ.
+However, in the real world, if the scheduling latency is short enough, there won't be a visible downgrade in throughput.
+This will only happen if there are no pods in activeQ, so this can be mitigated by an appropriate rate of pod creation.
 
 #### Backoff won't be working as natural rate limiter in case of errors
 
 In case of API calls errors (e.g. network issues), backoffQ allows to limit number of retries in a short term.
 This proposal will take those pods earlier, leading to losing this mechanism.
 
 After merging [kubernetes#128748](github.com/kubernetes/kubernetes/pull/128748), 
-it will be possible to distinguish pods backing off because of errors from those backing off because of unschedulable attempt. 
-This information could be used when popping, by filtering only the pods that are from unschedulable attempt or even splitting backoffQ. 
+it will be possible to distinguish pods backing off because of errors from those backing off because of unschedulable attempt.
+This information could be used when popping, by filtering only the pods that are from unschedulable attempt or even splitting backoffQ.
 
-#### One pod in backoffQ could starve the others
+This has to be resolved before the beta is released.
 
-If a pod popped from the backoffQ fails its scheduling attempt and come back to the queue, it might be selected again, ahead of other pods.
+#### One pod in backoffQ could starve the others
 
-To prevent this, while popping pod from backoffQ, its attempt counter will be incremented as if it had been taken from the activeQ.
-This will give other pods a chance to be scheduled.
+The head of BackoffQ is the pod with the closest backoff expiration,
+and the backoff time is calculated based on the number of scheduling failures that the pod has experienced.
+If one pod has a smaller attempt counter than others,
+could the scheduler keep popping this pod ahead of other pods because the pod's backoff expires faster than others?
+Actually, that wouldn't happen because the scheduler would increment the attempt counter of pods from backoffQ as well,
+which would make the backoff time of pods bigger every after the scheduling attempt,
+and the pod that had a smaller attempt number eventually won't be popped out.
 
 ## Design Details
 
 ### Popping from backoffQ in activeQ's pop()
 
 To achieve the goal, activeQ's `pop()` method needs to be changed:
-1. If activeQ is empty, then instead of waiting on condition, popping from backoffQ is tried.
-2. If backoffQ is empty, then `pop()` is waiting on condition as previously.
+1. If activeQ is empty, then instead of waiting for a pod to arrive at activeQ, popping from backoffQ is tried.
+2. If backoffQ is empty, then `pop()` is waiting for pod as previously.
 3. If backoffQ is not empty, then the pod is processed like the pod would be taken from activeQ, including increasing attempts number.
    It is poping from a heap data structure, so it should be fast enough not to cause any performance troubles.
 
+To support monitoring, when popping from backoffQ, 
+the `scheduler_queue_incoming_pods_total` metric with an `activeQ` queue and a new `PopFromBackoffQ` event label will be incremented.
+
 ### Notifying activeQ condition when new pod appears in backoffQ
 
 Pods might appear in backoffQ while `pop()` is hanging on point 2. 
@@ -220,6 +227,7 @@ Whole feature should be already covered by integration tests.
 
 - Gather feedback from users and fix reported bugs.
 - Change the feature flag to be enabled by default.
+- Make sure [backoff in case of error](#backoff-wont-be-working-as-natural-rate-limiter-in-case-of-errors) is not skipped.
 
 #### GA
 
@@ -229,7 +237,7 @@ Whole feature should be already covered by integration tests.
 
 **Upgrade**
 
-During the alpha period, users have to enable the feature gate `PopBackoffQWhenEmptyActiveQ` to opt in this feature.
+During the alpha period, users have to enable the feature gate `SchedulerPopFromBackoffQ` to opt in this feature.
 This is purely in-memory feature for kube-scheduler, so no special actions are required outside the scheduler.
 
 **Downgrade**
@@ -247,12 +255,12 @@ This is purely in-memory feature for kube-scheduler, and hence no version skew s
 ###### How can this feature be enabled / disabled in a live cluster?
 
 - [x] Feature gate (also fill in values in `kep.yaml`)
-  - Feature gate name: `PopBackoffQWhenEmptyActiveQ`
+  - Feature gate name: `SchedulerPopFromBackoffQ`
   - Components depending on the feature gate: kube-scheduler
 
 ###### Does enabling the feature change any default behavior?
 
-Pods that are backoffQ might be scheduled earlier when activeQ is empty.
+Pods that are backing off might be scheduled earlier when activeQ is empty.
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
@@ -287,6 +295,7 @@ Abnormal values of metrics related to scheduling queue, meaning pods are stuck i
 - `scheduler_schedule_attempts_total` metric with `scheduled` label is almost constant, while there are pending pods that should be schedulable. 
   This could mean that pods from backoffQ are taken instead of those from activeQ.
 - `scheduler_pending_pods` metric with `active` label is not decreasing, while with `backoff` is almost constant.
+- `scheduler_queue_incoming_pods_total` metric with `BackoffPop` label is increasing when there are pods in activeQ.
 - `scheduler_pod_scheduling_sli_duration_seconds` metric is visibly higher for schedulable pods.
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
@@ -304,6 +313,7 @@ No
 ###### How can an operator determine if the feature is in use by workloads?
 
 This feature is used during scheduling when activeQ is empty and if the feature gate is enabled.
+Also, `scheduler_queue_incoming_pods_total` could be checked, by querying for new `PopFromBackoffQ` event label.
 
 ###### How can someone using this feature know that it is working for their instance?
 
diff --git a/keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/kep.yaml b/keps/sig-scheduling/5142-pop-backoffq-when-activeq-empty/kep.yaml
@@ -21,7 +21,7 @@ milestone:
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
 feature-gates:
-  - name: PopBackoffQWhenEmptyActiveQ
+  - name: SchedulerPopFromBackoffQ
     components:
       - kube-scheduler
 disable-supported: true