15
15
- [ Alpha (v1.21):] ( #alpha-v121 )
16
16
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
17
17
- [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback )
18
+ - [ Rollout, Upgrade and Rollback Planning] ( #rollout-upgrade-and-rollback-planning )
19
+ - [ Monitoring Requirements] ( #monitoring-requirements )
20
+ - [ Dependencies] ( #dependencies )
21
+ - [ Scalability] ( #scalability )
22
+ - [ Troubleshooting] ( #troubleshooting )
18
23
- [ Implementation History] ( #implementation-history )
19
24
<!-- /toc -->
20
25
23
28
Items marked with (R) are required * prior to targeting to a milestone / release* .
24
29
25
30
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
26
- - [ ] (R) KEP approvers have approved the KEP status as ` implementable `
31
+ - [x ] (R) KEP approvers have approved the KEP status as ` implementable `
27
32
- [x] (R) Design details are appropriately documented
28
- - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
29
- - [ ] (R) Graduation criteria is in place
30
- - [ ] (R) Production readiness review completed
31
- - [ ] Production readiness review approved
33
+ - [x ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
34
+ - [x ] (R) Graduation criteria is in place
35
+ - [x ] (R) Production readiness review completed
36
+ - [x] (R) Production readiness review approved
32
37
- [ ] "Implementation History" section is up-to-date for milestone
33
38
- [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
34
39
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
@@ -105,14 +110,6 @@ nominated node.
105
110
For both of above cases, scheduler will continue to evaluate the rest of nodes to check if there is any node already
106
111
available for the coming pod.
107
112
108
- Scheduler will retry until matching either of the following cases,
109
- - ` NominatedNode ` eventually released all the resource and the preemptor pod can be scheduled on that node.
110
- - Another node in the cluster released enough resources and pod get scheduled on that node instead.
111
- [ Discuss] Should scheduler clear the ` NominatedNode ` in this case?
112
- - Resource cannot be released on the ` NominatedNode ` and no other candidate node could be found in the cluster, this will
113
- be covered by [ issue 95752] ( https://github.com/kubernetes/kubernetes/issues/95752 ) .
114
-
115
-
116
113
### Test Plan
117
114
118
115
Following tests will be covered or considered:
@@ -131,9 +128,9 @@ Following tests will be covered or considered:
131
128
132
129
#### Alpha (v1.21):
133
130
134
- - [ ] New feature gate proposed to enable the feature.
135
- - [ ] Implementation of the new feature in scheduling framework.
136
- - [ ] Test cases mentioned in the [ Test Plan] ( #test-plan ) .
131
+ - [x ] New feature gate proposed to enable the feature.
132
+ - [x ] Implementation of the new feature in scheduling framework.
133
+ - [x ] Test cases mentioned in the [ Test Plan] ( #test-plan ) .
137
134
138
135
## Production Readiness Review Questionnaire
139
136
@@ -146,10 +143,152 @@ _This section must be completed when targeting alpha to a release._
146
143
- Feature gate name: PreferNominatedNode
147
144
- Components depending on the feature gate: kube-scheduler
148
145
146
+ * ** Does enabling the feature change any default behavior?**
147
+ Yes. If the coming pod has the nominated node set, then the nominated node will be evaluated first in any
148
+ scheduling cycle, this is only the default process logic that is handled by scheduler, end-user will not
149
+ and need not aware of any difference.
150
+
151
+ * ** Can the feature be disabled once it has been enabled (i.e. can we roll back
152
+ the enablement)?**
153
+ Yes. It can be disabled by restarting scheduler with feature gate turned off.
154
+
155
+ * ** What happens if we reenable the feature if it was previously rolled back?**
156
+ The feature will start working again when scheduling pods.
157
+
149
158
* ** Are there any tests for feature enablement/disablement?**
150
- unittest will cover this.
159
+ unittest will switch the feature gate manually to enable the feature, and compare the different behavior.
160
+
161
+ ### Rollout, Upgrade and Rollback Planning
162
+
163
+ _ This section must be completed when targeting beta graduation to a release._
164
+
165
+ * ** How can a rollout fail? Can it impact already running workloads?**
166
+ Try to be as paranoid as possible - e.g., what if some components will restart
167
+ mid-rollout?
168
+
169
+ * ** What specific metrics should inform a rollback?**
170
+
171
+ * ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
172
+ Describe manual testing that was done and the outcomes.
173
+ Longer term, we may want to require automated upgrade/rollback tests, but we
174
+ are missing a bunch of machinery and tooling and can't do that now.
175
+
176
+ * ** Is the rollout accompanied by any deprecations and/or removals of features, APIs,
177
+ fields of API types, flags, etc.?**
178
+ Even if applying deprecation policies, they may still surprise some users.
179
+
180
+ ### Monitoring Requirements
181
+
182
+ _ This section must be completed when targeting beta graduation to a release._
183
+
184
+ * ** How can an operator determine if the feature is in use by workloads?**
185
+ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
186
+ checking if there are objects with field X set) may be a last resort. Avoid
187
+ logs or events for this purpose.
188
+
189
+ * ** What are the SLIs (Service Level Indicators) an operator can use to determine
190
+ the health of the service?**
191
+ - [ ] Metrics
192
+ - Metric name:
193
+ - [ Optional] Aggregation method:
194
+ - Components exposing the metric:
195
+ - [ ] Other (treat as last resort)
196
+ - Details:
197
+
198
+ * ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
199
+ At a high level, this usually will be in the form of "high percentile of SLI
200
+ per day <= X". It's impossible to provide comprehensive guidance, but at the very
201
+ high level (needs more precise definitions) those may be things like:
202
+ - per-day percentage of API calls finishing with 5XX errors <= 1%
203
+ - 99% percentile over day of absolute value from (job creation time minus expected
204
+ job creation time) for cron job <= 10%
205
+ - 99,9% of /health requests per day finish with 200 code
206
+
207
+ * ** Are there any missing metrics that would be useful to have to improve observability
208
+ of this feature?**
209
+ Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
210
+ implementation difficulties, etc.).
211
+
212
+ ### Dependencies
213
+
214
+ _ This section must be completed when targeting beta graduation to a release._
215
+
216
+ * ** Does this feature depend on any specific services running in the cluster?**
217
+ Think about both cluster-level services (e.g. metrics-server) as well
218
+ as node-level agents (e.g. specific version of CRI). Focus on external or
219
+ optional services that are needed. For example, if this feature depends on
220
+ a cloud provider API, or upon an external software-defined storage or network
221
+ control plane.
222
+
223
+ For each of these, fill in the following—thinking about running existing user workloads
224
+ and creating new ones, as well as about cluster-level services (e.g. DNS):
225
+ - [ Dependency name]
226
+ - Usage description:
227
+ - Impact of its outage on the feature:
228
+ - Impact of its degraded performance or high-error rates on the feature:
229
+
230
+ ### Scalability
231
+
232
+ _ For alpha, this section is encouraged: reviewers should consider these questions
233
+ and attempt to answer them._
234
+
235
+ _ For beta, this section is required: reviewers must answer these questions._
236
+
237
+ _ For GA, this section is required: approvers should be able to confirm the
238
+ previous answers based on experience in the field._
239
+
240
+ * ** Will enabling / using this feature result in any new API calls?**
241
+ Describe them, providing:
242
+ No.
243
+
244
+ * ** Will enabling / using this feature result in introducing new API types?**
245
+ Describe them, providing:
246
+ No.
247
+
248
+ * ** Will enabling / using this feature result in any new calls to the cloud
249
+ provider?**
250
+ No.
251
+
252
+ * ** Will enabling / using this feature result in increasing size or count of
253
+ the existing API objects?**
254
+ No.
255
+
256
+ * ** Will enabling / using this feature result in increasing time taken by any
257
+ operations covered by [ existing SLIs/SLOs] ?**
258
+ No.
259
+
260
+ * ** Will enabling / using this feature result in non-negligible increase of
261
+ resource usage (CPU, RAM, disk, IO, ...) in any components?**
262
+ No.
263
+
264
+ ### Troubleshooting
265
+
266
+ The Troubleshooting section currently serves the ` Playbook ` role. We may consider
267
+ splitting it into a dedicated ` Playbook ` document (potentially with some monitoring
268
+ details). For now, we leave it here.
269
+
270
+ _ This section must be completed when targeting beta graduation to a release._
271
+
272
+ * ** How does this feature react if the API server and/or etcd is unavailable?**
151
273
274
+ * ** What are other known failure modes?**
275
+ For each of them, fill in the following information by copying the below template:
276
+ - [ Failure mode brief description]
277
+ - Detection: How can it be detected via metrics? Stated another way:
278
+ how can an operator troubleshoot without logging into a master or worker node?
279
+ - Mitigations: What can be done to stop the bleeding, especially for already
280
+ running user workloads?
281
+ - Diagnostics: What are the useful log messages and their required logging
282
+ levels that could help debug the issue?
283
+ Not required until feature graduated to beta.
284
+ - Testing: Are there any tests for failure mode? If not, describe why.
285
+
286
+ * ** What steps should be taken if SLOs are not being met to determine the problem?**
287
+
288
+ [ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
289
+ [ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
152
290
153
291
## Implementation History
154
292
155
293
- 2020-09-29: Initial KEP sent out for review https://github.com/kubernetes/enhancements/pull/2026
294
+ - 2020-12-17: Mark the KEP as implementable
0 commit comments