Skip to content

Commit 35bcbc9

Browse files
authored
Merge pull request kubernetes#2026 from chendave/tryNominatedNodeFirst
Add KEP 1923 - try nominated node first
2 parents da0fd16 + de6d84f commit 35bcbc9

File tree

2 files changed

+182
-0
lines changed

2 files changed

+182
-0
lines changed
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# KEP-1923: Prefer Nominated Node
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Proposal](#proposal)
9+
- [User Stories (Optional)](#user-stories-optional)
10+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
11+
- [Design Details](#design-details)
12+
- [Implementation Details](#implementation-details)
13+
- [Test Plan](#test-plan)
14+
- [Graduation Criteria](#graduation-criteria)
15+
- [Alpha (v1.21):](#alpha-v121)
16+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
17+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
18+
- [Implementation History](#implementation-history)
19+
<!-- /toc -->
20+
21+
## Release Signoff Checklist
22+
23+
Items marked with (R) are required *prior to targeting to a milestone / release*.
24+
25+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
26+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
27+
- [x] (R) Design details are appropriately documented
28+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
29+
- [ ] (R) Graduation criteria is in place
30+
- [ ] (R) Production readiness review completed
31+
- [ ] Production readiness review approved
32+
- [ ] "Implementation History" section is up-to-date for milestone
33+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
34+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
35+
36+
<!--
37+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
38+
-->
39+
40+
[kubernetes.io]: https://kubernetes.io/
41+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
42+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
43+
[kubernetes/website]: https://git.k8s.io/website
44+
45+
## Summary
46+
47+
This KEP proposes to change the scheduling cycle such that nominated node of a pod is evaluated first
48+
and schedule the pod on that node if it fits. If the nominated node doesn't fit the pod, only then the
49+
scheduling cycle continues with the standard logic of evaluating the rest of the nodes in the cluster.
50+
51+
## Motivation
52+
53+
If the scheduler fails to fit an incoming pod on any node, it will try to preempt lower priority pods
54+
running on a selected node to make room for the pod. The name of this node will be set in the
55+
pod's `.status.nominatedNodeName`.
56+
57+
The Node is called *Nominated* to indicate the intent for the Pod to be scheduled on it once preemption
58+
of other Pods finishes. However, the Pod's `.status.nominatedNodeName` information is not fully utilized
59+
in the Pod's following scheduling attempts.
60+
61+
Pod scheduling is split into two phases, the scheduling cycle and the binding cycle, the scheduling cycle
62+
primarily includes filtering and scoring.
63+
64+
When preemption happens in a previous scheduling cycle, there is a high chance that the nominated node is
65+
the *only* node that satisfies the filters for the unscheduled Pod that triggered preemption.
66+
67+
In real production environment, pods can have different priorites due to business needs, the preemption
68+
could happen to make sure higher priority pods could get scheduled.
69+
70+
In cluster with large number of computing nodes, evaluating all nodes when scheduling a pod is time consuming.
71+
72+
### Goals
73+
74+
Prefer scheduling a pod to its `.status.nominatedNodeName` if set, if the nominated node doesn't fit the pod,
75+
the scheduling cycle will continue to evaluate the rest of the nodes in the cluster just like we do today.
76+
77+
78+
## Proposal
79+
80+
### User Stories (Optional)
81+
82+
Users want faster scheduling. Since it is highly likely the pod will only fit on the nominated node, the improvement
83+
in scheduling latency will come at negligible cost (the cost being placing the pod on a less optimal node).
84+
85+
### Notes/Constraints/Caveats (Optional)
86+
87+
When this feature is enabled the preemptor Pod might not be dispatched to the best candidated node in some corner case,
88+
e.g. another node releases the resources and becomes the best candidate while the victim pods got removed from the
89+
nominated node.
90+
91+
## Design Details
92+
93+
### Implementation Details
94+
95+
1. In filtering phase, which is currently implemented in the method of `findNodesThatFitPod`, check the nominated node
96+
first if the incoming pod has the `.status.nominatedNodeName` defined and the feature gate is enabled.
97+
98+
2. In case the nominated node doesn't suit for the incoming pod anymore, get `err` from `findNodesThatPassFilters` where
99+
`NominatedNode` is firstly evaluated, the `err` will be padded with more information to tell that scheduler is evaluating
100+
the feasibility of `NominatedNode` and failed on that node.
101+
102+
If no error is returned but `NominatedNode` cannot pass all the filtering, this is possibly caused by the resource that
103+
claims to be removed but has not been fully released yet.
104+
105+
For both of above cases, scheduler will continue to evaluate the rest of nodes to check if there is any node already
106+
available for the coming pod.
107+
108+
Scheduler will retry until matching either of the following cases,
109+
- `NominatedNode` eventually released all the resource and the preemptor pod can be scheduled on that node.
110+
- Another node in the cluster released enough resources and pod get scheduled on that node instead.
111+
[Discuss] Should scheduler clear the `NominatedNode` in this case?
112+
- Resource cannot be released on the `NominatedNode` and no other candidate node could be found in the cluster, this will
113+
be covered by [issue 95752](https://github.com/kubernetes/kubernetes/issues/95752).
114+
115+
116+
### Test Plan
117+
118+
Following tests will be covered or considered:
119+
120+
- **Unit Tests**: All core changes must be covered by unit tests.
121+
- **Integration Tests**: Integration test will be provided if necessary, for example,
122+
- enable the feature
123+
- preempt the victim pods on the nominated node
124+
- check pod will be scheduled on the nominated node
125+
- **Benchmark Tests**: A benchmark test which compares the performance before and after the change.
126+
The performance improvement is visible by benchmark of `scheduling_algorithm_predicate_evaluation_seconds`.
127+
Other benchmark will be created on-demand along with the code review process.
128+
129+
130+
### Graduation Criteria
131+
132+
#### Alpha (v1.21):
133+
134+
- [ ] New feature gate proposed to enable the feature.
135+
- [ ] Implementation of the new feature in scheduling framework.
136+
- [ ] Test cases mentioned in the [Test Plan](#test-plan).
137+
138+
## Production Readiness Review Questionnaire
139+
140+
### Feature Enablement and Rollback
141+
142+
_This section must be completed when targeting alpha to a release._
143+
144+
* **How can this feature be enabled / disabled in a live cluster?**
145+
- [x] Feature gate (also fill in values in `kep.yaml`)
146+
- Feature gate name: PreferNominatedNode
147+
- Components depending on the feature gate: kube-scheduler
148+
149+
* **Are there any tests for feature enablement/disablement?**
150+
unittest will cover this.
151+
152+
153+
## Implementation History
154+
155+
- 2020-09-29: Initial KEP sent out for review https://github.com/kubernetes/enhancements/pull/2026
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
title: Prefer Nominated Node
2+
kep-number: 1923
3+
authors:
4+
- "@chendave"
5+
owning-sig: sig-scheduling
6+
participating-sigs:
7+
- sig-scheduling
8+
status: provisional
9+
creation-date: 2020-09-29
10+
reviewers:
11+
- "@alculquicondor"
12+
- "@Huang-Wei"
13+
- "@ahg-g"
14+
approvers:
15+
- "@Huang-Wei"
16+
- "@ahg-g"
17+
stage: alpha
18+
latest-milestone: "v1.21"
19+
milestone:
20+
alpha: "v1.21"
21+
beta: "v1.22"
22+
stable: "v1.24"
23+
feature-gates:
24+
- name: PreferNominatedNode
25+
components:
26+
- kube-scheduler
27+
disable-supported: true

0 commit comments

Comments
 (0)