Skip to content

Commit 1a66671

Browse files
authored
Merge pull request kubernetes#2764 from chendave/beta
Graduate prefer nominated node to beta
2 parents c090d28 + 083b901 commit 1a66671

File tree

3 files changed

+108
-57
lines changed

3 files changed

+108
-57
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 1923
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-scheduling/1923-prefer-nominated-node/README.md

Lines changed: 92 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
- [Implementation Details](#implementation-details)
1313
- [Test Plan](#test-plan)
1414
- [Graduation Criteria](#graduation-criteria)
15-
- [Alpha (v1.21):](#alpha-v121)
15+
- [Alpha](#alpha)
16+
- [Beta](#beta)
1617
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
1718
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
1819
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
@@ -34,7 +35,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
3435
- [x] (R) Graduation criteria is in place
3536
- [x] (R) Production readiness review completed
3637
- [x] (R) Production readiness review approved
37-
- [ ] "Implementation History" section is up-to-date for milestone
38+
- [x] "Implementation History" section is up-to-date for milestone
3839
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
3940
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4041

@@ -120,17 +121,82 @@ Following tests will be covered or considered:
120121
- preempt the victim pods on the nominated node
121122
- check pod will be scheduled on the nominated node
122123
- **Benchmark Tests**: A benchmark test which compares the performance before and after the change.
123-
The performance improvement is visible by benchmark of `scheduling_algorithm_predicate_evaluation_seconds`.
124-
Other benchmark will be created on-demand along with the code review process.
124+
The performance improvement is visible by benchmark of `scheduler_framework_extension_point_duration_seconds{extension_point="Filter"}` in a large cluster
125+
where preemption is expected to happen frequently.
125126

126127

127128
### Graduation Criteria
128129

129-
#### Alpha (v1.21):
130+
<!--
131+
**Note:** *Not required until targeted at a release.*
132+
133+
Define graduation milestones.
134+
135+
These may be defined in terms of API maturity, or as something else. The KEP
136+
should keep this high-level with a focus on what signals will be looked at to
137+
determine graduation.
138+
139+
Consider the following in developing the graduation criteria for this enhancement:
140+
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
141+
- [Deprecation policy][deprecation-policy]
142+
143+
Clearly define what graduation means by either linking to the [API doc
144+
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
145+
or by redefining what graduation means.
146+
147+
In general we try to use the same stages (alpha, beta, GA), regardless of how the
148+
functionality is accessed.
149+
150+
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
151+
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
152+
153+
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
154+
155+
#### Alpha
156+
157+
- Feature implemented behind a feature flag
158+
- Initial e2e tests completed and enabled
159+
160+
#### Beta
161+
162+
- Gather feedback from developers and surveys
163+
- Complete features A, B, C
164+
- Additional tests are in Testgrid and linked in KEP
165+
166+
#### GA
167+
168+
- N examples of real-world usage
169+
- N installs
170+
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
171+
- Allowing time for feedback
130172
131-
- [x] New feature gate proposed to enable the feature.
132-
- [x] Implementation of the new feature in scheduling framework.
133-
- [x] Test cases mentioned in the [Test Plan](#test-plan).
173+
**Note:** Generally we also wait at least two releases between beta and
174+
GA/stable, because there's no opportunity for user feedback, or even bug reports,
175+
in back-to-back releases.
176+
177+
**For non-optional features moving to GA, the graduation criteria must include
178+
[conformance tests].**
179+
180+
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
181+
182+
#### Deprecation
183+
184+
- Announce deprecation and support policy of the existing flag
185+
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
186+
- Address feedback on usage/changed behavior, provided on GitHub issues
187+
- Deprecate the flag
188+
-->
189+
190+
#### Alpha
191+
192+
- New feature gate proposed to enable the feature.
193+
- Implementation of the new feature in scheduling framework.
194+
- Test cases mentioned in the [Test Plan](#test-plan).
195+
196+
#### Beta
197+
198+
- Gather feedback from developers and surveys.
199+
- The feature is guarded by a feature flag, and will be enabled by default in beta.
134200

135201
## Production Readiness Review Questionnaire
136202

@@ -163,69 +229,49 @@ _This section must be completed when targeting alpha to a release._
163229
_This section must be completed when targeting beta graduation to a release._
164230

165231
* **How can a rollout fail? Can it impact already running workloads?**
166-
Try to be as paranoid as possible - e.g., what if some components will restart
167-
mid-rollout?
232+
The rollout can always fail (e.g. if there is a bug and scheduler will start crashlooping on certain conditions).
233+
It's a scheduler features, so it doesn't affect already running workloads.
168234

169235
* **What specific metrics should inform a rollback?**
236+
Noticeable and sustainable increase in `scheduler_framework_extension_point_duration_seconds{extension_point="Filter"}`
237+
latency metric.
170238

171239
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
172-
Describe manual testing that was done and the outcomes.
173-
Longer term, we may want to require automated upgrade/rollback tests, but we
174-
are missing a bunch of machinery and tooling and can't do that now.
240+
Manually tested successfully.
175241

176242
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
177243
fields of API types, flags, etc.?**
178-
Even if applying deprecation policies, they may still surprise some users.
244+
No.
179245

180246
### Monitoring Requirements
181247

182248
_This section must be completed when targeting beta graduation to a release._
183249

184250
* **How can an operator determine if the feature is in use by workloads?**
185-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
186-
checking if there are objects with field X set) may be a last resort. Avoid
187-
logs or events for this purpose.
251+
N/A
188252

189253
* **What are the SLIs (Service Level Indicators) an operator can use to determine
190254
the health of the service?**
191-
- [ ] Metrics
192-
- Metric name:
193-
- [Optional] Aggregation method:
194-
- Components exposing the metric:
255+
- [x] Metrics
256+
- Metric name: `scheduler_framework_extension_point_duration_seconds{extension_point="Filter"}`
257+
- Components exposing the metric: kube-scheduler
195258
- [ ] Other (treat as last resort)
196259
- Details:
197260

198261
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
199-
At a high level, this usually will be in the form of "high percentile of SLI
200-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
201-
high level (needs more precise definitions) those may be things like:
202-
- per-day percentage of API calls finishing with 5XX errors <= 1%
203-
- 99% percentile over day of absolute value from (job creation time minus expected
204-
job creation time) for cron job <= 10%
205-
- 99,9% of /health requests per day finish with 200 code
262+
- 99% of filter latency for the pod scheduling is within x seconds.
263+
206264

207265
* **Are there any missing metrics that would be useful to have to improve observability
208266
of this feature?**
209-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
210-
implementation difficulties, etc.).
267+
No.
211268

212269
### Dependencies
213270

214271
_This section must be completed when targeting beta graduation to a release._
215272

216273
* **Does this feature depend on any specific services running in the cluster?**
217-
Think about both cluster-level services (e.g. metrics-server) as well
218-
as node-level agents (e.g. specific version of CRI). Focus on external or
219-
optional services that are needed. For example, if this feature depends on
220-
a cloud provider API, or upon an external software-defined storage or network
221-
control plane.
222-
223-
For each of these, fill in the following—thinking about running existing user workloads
224-
and creating new ones, as well as about cluster-level services (e.g. DNS):
225-
- [Dependency name]
226-
- Usage description:
227-
- Impact of its outage on the feature:
228-
- Impact of its degraded performance or high-error rates on the feature:
274+
No.
229275

230276
### Scalability
231277

@@ -270,25 +316,16 @@ details). For now, we leave it here.
270316
_This section must be completed when targeting beta graduation to a release._
271317

272318
* **How does this feature react if the API server and/or etcd is unavailable?**
319+
No impact since the pod is already in the internal Queue.
273320

274321
* **What are other known failure modes?**
275-
For each of them, fill in the following information by copying the below template:
276-
- [Failure mode brief description]
277-
- Detection: How can it be detected via metrics? Stated another way:
278-
how can an operator troubleshoot without logging into a master or worker node?
279-
- Mitigations: What can be done to stop the bleeding, especially for already
280-
running user workloads?
281-
- Diagnostics: What are the useful log messages and their required logging
282-
levels that could help debug the issue?
283-
Not required until feature graduated to beta.
284-
- Testing: Are there any tests for failure mode? If not, describe why.
322+
N/A
285323

286324
* **What steps should be taken if SLOs are not being met to determine the problem?**
287-
288-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
289-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
325+
N/A
290326

291327
## Implementation History
292328

293329
- 2020-09-29: Initial KEP sent out for review https://github.com/kubernetes/enhancements/pull/2026
294330
- 2020-12-17: Mark the KEP as implementable
331+
- 2021-05-21: Graduate the feature to Beta

keps/sig-scheduling/1923-prefer-nominated-node/kep.yaml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,26 @@ approvers:
1717
prr-approvers:
1818
- "@wojtek-t"
1919

20-
stage: alpha
21-
latest-milestone: "v1.21"
20+
# The target maturity stage in the current dev cycle for this KEP.
21+
stage: beta
22+
23+
# The most recent milestone for which work toward delivery of this KEP has been
24+
# done. This can be the current (upcoming) milestone, if it is being actively
25+
# worked on.
26+
latest-milestone: "v1.22"
27+
28+
# The milestone at which this feature was, or is targeted to be, at each stage.
2229
milestone:
2330
alpha: "v1.21"
2431
beta: "v1.22"
2532
stable: "v1.24"
33+
2634
feature-gates:
2735
- name: PreferNominatedNode
2836
components:
2937
- kube-scheduler
3038
disable-supported: true
39+
40+
# The following PRR answers are required at beta release
41+
metrics:
42+
- scheduler_framework_extension_point_duration_seconds{extension_point="Filter"}

0 commit comments

Comments
 (0)