|
| 1 | +# KEP-1923: Prefer Nominated Node |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | +- [Proposal](#proposal) |
| 9 | + - [User Stories (Optional)](#user-stories-optional) |
| 10 | + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) |
| 11 | +- [Design Details](#design-details) |
| 12 | + - [Implementation Details](#implementation-details) |
| 13 | + - [Test Plan](#test-plan) |
| 14 | + - [Graduation Criteria](#graduation-criteria) |
| 15 | + - [Alpha (v1.21):](#alpha-v121) |
| 16 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 17 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 18 | +- [Implementation History](#implementation-history) |
| 19 | +<!-- /toc --> |
| 20 | + |
| 21 | +## Release Signoff Checklist |
| 22 | + |
| 23 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 24 | + |
| 25 | +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 26 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 27 | +- [x] (R) Design details are appropriately documented |
| 28 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 29 | +- [ ] (R) Graduation criteria is in place |
| 30 | +- [ ] (R) Production readiness review completed |
| 31 | +- [ ] Production readiness review approved |
| 32 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 33 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 34 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 35 | + |
| 36 | +<!-- |
| 37 | +**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone. |
| 38 | +--> |
| 39 | + |
| 40 | +[kubernetes.io]: https://kubernetes.io/ |
| 41 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 42 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 43 | +[kubernetes/website]: https://git.k8s.io/website |
| 44 | + |
| 45 | +## Summary |
| 46 | + |
| 47 | +This KEP proposes to change the scheduling cycle such that nominated node of a pod is evaluated first |
| 48 | +and schedule the pod on that node if it fits. If the nominated node doesn't fit the pod, only then the |
| 49 | +scheduling cycle continues with the standard logic of evaluating the rest of the nodes in the cluster. |
| 50 | + |
| 51 | +## Motivation |
| 52 | + |
| 53 | +If the scheduler fails to fit an incoming pod on any node, it will try to preempt lower priority pods |
| 54 | +running on a selected node to make room for the pod. The name of this node will be set in the |
| 55 | +pod's `.status.nominatedNodeName`. |
| 56 | + |
| 57 | +The Node is called *Nominated* to indicate the intent for the Pod to be scheduled on it once preemption |
| 58 | +of other Pods finishes. However, the Pod's `.status.nominatedNodeName` information is not fully utilized |
| 59 | +in the Pod's following scheduling attempts. |
| 60 | + |
| 61 | +Pod scheduling is split into two phases, the scheduling cycle and the binding cycle, the scheduling cycle |
| 62 | +primarily includes filtering and scoring. |
| 63 | + |
| 64 | +When preemption happens in a previous scheduling cycle, there is a high chance that the nominated node is |
| 65 | +the *only* node that satisfies the filters for the unscheduled Pod that triggered preemption. |
| 66 | + |
| 67 | +In real production environment, pods can have different priorites due to business needs, the preemption |
| 68 | +could happen to make sure higher priority pods could get scheduled. |
| 69 | + |
| 70 | +In cluster with large number of computing nodes, evaluating all nodes when scheduling a pod is time consuming. |
| 71 | + |
| 72 | +### Goals |
| 73 | + |
| 74 | +Prefer scheduling a pod to its `.status.nominatedNodeName` if set, if the nominated node doesn't fit the pod, |
| 75 | +the scheduling cycle will continue to evaluate the rest of the nodes in the cluster just like we do today. |
| 76 | + |
| 77 | + |
| 78 | +## Proposal |
| 79 | + |
| 80 | +### User Stories (Optional) |
| 81 | + |
| 82 | +Users want faster scheduling. Since it is highly likely the pod will only fit on the nominated node, the improvement |
| 83 | +in scheduling latency will come at negligible cost (the cost being placing the pod on a less optimal node). |
| 84 | + |
| 85 | +### Notes/Constraints/Caveats (Optional) |
| 86 | + |
| 87 | +When this feature is enabled the preemptor Pod might not be dispatched to the best candidated node in some corner case, |
| 88 | +e.g. another node releases the resources and becomes the best candidate while the victim pods got removed from the |
| 89 | +nominated node. |
| 90 | + |
| 91 | +## Design Details |
| 92 | + |
| 93 | +### Implementation Details |
| 94 | + |
| 95 | +1. In filtering phase, which is currently implemented in the method of `findNodesThatFitPod`, check the nominated node |
| 96 | + first if the incoming pod has the `.status.nominatedNodeName` defined and the feature gate is enabled. |
| 97 | + |
| 98 | +2. In case the nominated node doesn't suit for the incoming pod anymore, get `err` from `findNodesThatPassFilters` where |
| 99 | + `NominatedNode` is firstly evaluated, the `err` will be padded with more information to tell that scheduler is evaluating |
| 100 | + the feasibility of `NominatedNode` and failed on that node. |
| 101 | + |
| 102 | + If no error is returned but `NominatedNode` cannot pass all the filtering, this is possibly caused by the resource that |
| 103 | + claims to be removed but has not been fully released yet. |
| 104 | + |
| 105 | + For both of above cases, scheduler will continue to evaluate the rest of nodes to check if there is any node already |
| 106 | + available for the coming pod. |
| 107 | + |
| 108 | + Scheduler will retry until matching either of the following cases, |
| 109 | + - `NominatedNode` eventually released all the resource and the preemptor pod can be scheduled on that node. |
| 110 | + - Another node in the cluster released enough resources and pod get scheduled on that node instead. |
| 111 | + [Discuss] Should scheduler clear the `NominatedNode` in this case? |
| 112 | + - Resource cannot be released on the `NominatedNode` and no other candidate node could be found in the cluster, this will |
| 113 | + be covered by [issue 95752](https://github.com/kubernetes/kubernetes/issues/95752). |
| 114 | + |
| 115 | + |
| 116 | +### Test Plan |
| 117 | + |
| 118 | +Following tests will be covered or considered: |
| 119 | + |
| 120 | +- **Unit Tests**: All core changes must be covered by unit tests. |
| 121 | +- **Integration Tests**: Integration test will be provided if necessary, for example, |
| 122 | + - enable the feature |
| 123 | + - preempt the victim pods on the nominated node |
| 124 | + - check pod will be scheduled on the nominated node |
| 125 | +- **Benchmark Tests**: A benchmark test which compares the performance before and after the change. |
| 126 | + The performance improvement is visible by benchmark of `scheduling_algorithm_predicate_evaluation_seconds`. |
| 127 | + Other benchmark will be created on-demand along with the code review process. |
| 128 | + |
| 129 | + |
| 130 | +### Graduation Criteria |
| 131 | + |
| 132 | +#### Alpha (v1.21): |
| 133 | + |
| 134 | +- [ ] New feature gate proposed to enable the feature. |
| 135 | +- [ ] Implementation of the new feature in scheduling framework. |
| 136 | +- [ ] Test cases mentioned in the [Test Plan](#test-plan). |
| 137 | + |
| 138 | +## Production Readiness Review Questionnaire |
| 139 | + |
| 140 | +### Feature Enablement and Rollback |
| 141 | + |
| 142 | +_This section must be completed when targeting alpha to a release._ |
| 143 | + |
| 144 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 145 | + - [x] Feature gate (also fill in values in `kep.yaml`) |
| 146 | + - Feature gate name: PreferNominatedNode |
| 147 | + - Components depending on the feature gate: kube-scheduler |
| 148 | + |
| 149 | +* **Are there any tests for feature enablement/disablement?** |
| 150 | + unittest will cover this. |
| 151 | + |
| 152 | + |
| 153 | +## Implementation History |
| 154 | + |
| 155 | +- 2020-09-29: Initial KEP sent out for review https://github.com/kubernetes/enhancements/pull/2026 |
0 commit comments