12
12
- [ Implementation Details] ( #implementation-details )
13
13
- [ Test Plan] ( #test-plan )
14
14
- [ Graduation Criteria] ( #graduation-criteria )
15
- - [ Alpha (v1.21):] ( #alpha-v121 )
15
+ - [ Alpha] ( #alpha )
16
+ - [ Beta] ( #beta )
16
17
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
17
18
- [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback )
18
19
- [ Rollout, Upgrade and Rollback Planning] ( #rollout-upgrade-and-rollback-planning )
@@ -34,7 +35,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
34
35
- [x] (R) Graduation criteria is in place
35
36
- [x] (R) Production readiness review completed
36
37
- [x] (R) Production readiness review approved
37
- - [ ] "Implementation History" section is up-to-date for milestone
38
+ - [x ] "Implementation History" section is up-to-date for milestone
38
39
- [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
39
40
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
40
41
@@ -120,17 +121,82 @@ Following tests will be covered or considered:
120
121
- preempt the victim pods on the nominated node
121
122
- check pod will be scheduled on the nominated node
122
123
- ** Benchmark Tests** : A benchmark test which compares the performance before and after the change.
123
- The performance improvement is visible by benchmark of ` scheduling_algorithm_predicate_evaluation_seconds ` .
124
- Other benchmark will be created on-demand along with the code review process .
124
+ The performance improvement is visible by benchmark of ` scheduler_framework_extension_point_duration_seconds{extension_point="Filter"} ` in a large cluster
125
+ where preemption is expected to happen frequently .
125
126
126
127
127
128
### Graduation Criteria
128
129
129
- #### Alpha (v1.21):
130
+ <!--
131
+ **Note:** *Not required until targeted at a release.*
132
+
133
+ Define graduation milestones.
134
+
135
+ These may be defined in terms of API maturity, or as something else. The KEP
136
+ should keep this high-level with a focus on what signals will be looked at to
137
+ determine graduation.
138
+
139
+ Consider the following in developing the graduation criteria for this enhancement:
140
+ - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
141
+ - [Deprecation policy][deprecation-policy]
142
+
143
+ Clearly define what graduation means by either linking to the [API doc
144
+ definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
145
+ or by redefining what graduation means.
146
+
147
+ In general we try to use the same stages (alpha, beta, GA), regardless of how the
148
+ functionality is accessed.
149
+
150
+ [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
151
+ [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
152
+
153
+ Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
154
+
155
+ #### Alpha
156
+
157
+ - Feature implemented behind a feature flag
158
+ - Initial e2e tests completed and enabled
159
+
160
+ #### Beta
161
+
162
+ - Gather feedback from developers and surveys
163
+ - Complete features A, B, C
164
+ - Additional tests are in Testgrid and linked in KEP
165
+
166
+ #### GA
167
+
168
+ - N examples of real-world usage
169
+ - N installs
170
+ - More rigorous forms of testing—e.g., downgrade tests and scalability tests
171
+ - Allowing time for feedback
130
172
131
- - [x] New feature gate proposed to enable the feature.
132
- - [x] Implementation of the new feature in scheduling framework.
133
- - [x] Test cases mentioned in the [ Test Plan] ( #test-plan ) .
173
+ **Note:** Generally we also wait at least two releases between beta and
174
+ GA/stable, because there's no opportunity for user feedback, or even bug reports,
175
+ in back-to-back releases.
176
+
177
+ **For non-optional features moving to GA, the graduation criteria must include
178
+ [conformance tests].**
179
+
180
+ [conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
181
+
182
+ #### Deprecation
183
+
184
+ - Announce deprecation and support policy of the existing flag
185
+ - Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
186
+ - Address feedback on usage/changed behavior, provided on GitHub issues
187
+ - Deprecate the flag
188
+ -->
189
+
190
+ #### Alpha
191
+
192
+ - New feature gate proposed to enable the feature.
193
+ - Implementation of the new feature in scheduling framework.
194
+ - Test cases mentioned in the [ Test Plan] ( #test-plan ) .
195
+
196
+ #### Beta
197
+
198
+ - Gather feedback from developers and surveys.
199
+ - The feature is guarded by a feature flag, and will be enabled by default in beta.
134
200
135
201
## Production Readiness Review Questionnaire
136
202
@@ -163,69 +229,49 @@ _This section must be completed when targeting alpha to a release._
163
229
_ This section must be completed when targeting beta graduation to a release._
164
230
165
231
* ** How can a rollout fail? Can it impact already running workloads?**
166
- Try to be as paranoid as possible - e.g., what if some components will restart
167
- mid-rollout?
232
+ The rollout can always fail ( e.g. if there is a bug and scheduler will start crashlooping on certain conditions).
233
+ It's a scheduler features, so it doesn't affect already running workloads.
168
234
169
235
* ** What specific metrics should inform a rollback?**
236
+ Noticeable and sustainable increase in ` scheduler_framework_extension_point_duration_seconds{extension_point="Filter"} `
237
+ latency metric.
170
238
171
239
* ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
172
- Describe manual testing that was done and the outcomes.
173
- Longer term, we may want to require automated upgrade/rollback tests, but we
174
- are missing a bunch of machinery and tooling and can't do that now.
240
+ Manually tested successfully.
175
241
176
242
* ** Is the rollout accompanied by any deprecations and/or removals of features, APIs,
177
243
fields of API types, flags, etc.?**
178
- Even if applying deprecation policies, they may still surprise some users .
244
+ No .
179
245
180
246
### Monitoring Requirements
181
247
182
248
_ This section must be completed when targeting beta graduation to a release._
183
249
184
250
* ** How can an operator determine if the feature is in use by workloads?**
185
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
186
- checking if there are objects with field X set) may be a last resort. Avoid
187
- logs or events for this purpose.
251
+ N/A
188
252
189
253
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
190
254
the health of the service?**
191
- - [ ] Metrics
192
- - Metric name:
193
- - [ Optional] Aggregation method:
194
- - Components exposing the metric:
255
+ - [x] Metrics
256
+ - Metric name: ` scheduler_framework_extension_point_duration_seconds{extension_point="Filter"} `
257
+ - Components exposing the metric: kube-scheduler
195
258
- [ ] Other (treat as last resort)
196
259
- Details:
197
260
198
261
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
199
- At a high level, this usually will be in the form of "high percentile of SLI
200
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
201
- high level (needs more precise definitions) those may be things like:
202
- - per-day percentage of API calls finishing with 5XX errors <= 1%
203
- - 99% percentile over day of absolute value from (job creation time minus expected
204
- job creation time) for cron job <= 10%
205
- - 99,9% of /health requests per day finish with 200 code
262
+ - 99% of filter latency for the pod scheduling is within x seconds.
263
+
206
264
207
265
* ** Are there any missing metrics that would be useful to have to improve observability
208
266
of this feature?**
209
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
210
- implementation difficulties, etc.).
267
+ No.
211
268
212
269
### Dependencies
213
270
214
271
_ This section must be completed when targeting beta graduation to a release._
215
272
216
273
* ** Does this feature depend on any specific services running in the cluster?**
217
- Think about both cluster-level services (e.g. metrics-server) as well
218
- as node-level agents (e.g. specific version of CRI). Focus on external or
219
- optional services that are needed. For example, if this feature depends on
220
- a cloud provider API, or upon an external software-defined storage or network
221
- control plane.
222
-
223
- For each of these, fill in the following—thinking about running existing user workloads
224
- and creating new ones, as well as about cluster-level services (e.g. DNS):
225
- - [ Dependency name]
226
- - Usage description:
227
- - Impact of its outage on the feature:
228
- - Impact of its degraded performance or high-error rates on the feature:
274
+ No.
229
275
230
276
### Scalability
231
277
@@ -270,25 +316,16 @@ details). For now, we leave it here.
270
316
_ This section must be completed when targeting beta graduation to a release._
271
317
272
318
* ** How does this feature react if the API server and/or etcd is unavailable?**
319
+ No impact since the pod is already in the internal Queue.
273
320
274
321
* ** What are other known failure modes?**
275
- For each of them, fill in the following information by copying the below template:
276
- - [ Failure mode brief description]
277
- - Detection: How can it be detected via metrics? Stated another way:
278
- how can an operator troubleshoot without logging into a master or worker node?
279
- - Mitigations: What can be done to stop the bleeding, especially for already
280
- running user workloads?
281
- - Diagnostics: What are the useful log messages and their required logging
282
- levels that could help debug the issue?
283
- Not required until feature graduated to beta.
284
- - Testing: Are there any tests for failure mode? If not, describe why.
322
+ N/A
285
323
286
324
* ** What steps should be taken if SLOs are not being met to determine the problem?**
287
-
288
- [ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
289
- [ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
325
+ N/A
290
326
291
327
## Implementation History
292
328
293
329
- 2020-09-29: Initial KEP sent out for review https://github.com/kubernetes/enhancements/pull/2026
294
330
- 2020-12-17: Mark the KEP as implementable
331
+ - 2021-05-21: Graduate the feature to Beta
0 commit comments