10
10
- [ Proposal] ( #proposal )
11
11
- [ User Stories (Optional)] ( #user-stories-optional )
12
12
- [ Risks and Mitigations] ( #risks-and-mitigations )
13
+ - [ Test Plan] ( #test-plan )
13
14
- [ Graduation Criteria] ( #graduation-criteria )
14
15
- [ Alpha] ( #alpha )
15
16
- [ Beta] ( #beta )
16
17
- [ GA] ( #ga )
18
+ - [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
19
+ - [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback )
20
+ - [ Rollout, Upgrade and Rollback Planning] ( #rollout-upgrade-and-rollback-planning )
21
+ - [ Monitoring Requirements] ( #monitoring-requirements )
22
+ - [ Dependencies] ( #dependencies )
23
+ - [ Scalability] ( #scalability )
24
+ - [ Troubleshooting] ( #troubleshooting )
17
25
- [ Drawbacks] ( #drawbacks )
18
26
- [ Alternatives] ( #alternatives )
19
27
- [ Implementation History] ( #implementation-history )
20
28
<!-- /toc -->
21
29
22
30
## Release Signoff Checklist
23
31
24
- <!--
25
- **ACTION REQUIRED:** In order to merge code into a release, there must be an
26
- issue in [kubernetes/enhancements] referencing this KEP and targeting a release
27
- milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
28
- of the targeted release**.
29
-
30
- For enhancements that make changes to code or processes/procedures in core
31
- Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
32
- Signoff checklist to be completed.
33
-
34
- Check these off as they are completed for the Release Team to track. These
35
- checklist items _must_ be updated for the enhancement to be released.
36
- -->
37
-
38
32
Items marked with (R) are required _ prior to targeting to a milestone / release_ .
39
33
40
- - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
41
- - [ ] (R) KEP approvers have approved the KEP status as ` implementable `
42
- - [ ] (R) Design details are appropriately documented
43
- - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
34
+ - [x ] (R) Enhancement issue in release milestone, which links to KEP dir in [ kubernetes/enhancements] (not the initial KEP PR)
35
+ - [x ] (R) KEP approvers have approved the KEP status as ` implementable `
36
+ - [x ] (R) Design details are appropriately documented
37
+ - [x ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
44
38
- [ ] e2e Tests for all Beta API Operations (endpoints)
45
39
- [ ] (R) Ensure GA e2e tests for meet requirements for [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
46
40
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
47
- - [ ] (R) Graduation criteria is in place
41
+ - [x ] (R) Graduation criteria is in place
48
42
- [ ] (R) [ all GA Endpoints] ( https://github.com/kubernetes/community/pull/1806 ) must be hit by [ Conformance Tests] ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md )
49
- - [ ] (R) Production readiness review completed
50
- - [ ] (R) Production readiness review approved
51
- - [ ] "Implementation History" section is up-to-date for milestone
43
+ - [x ] (R) Production readiness review completed
44
+ - [x ] (R) Production readiness review approved
45
+ - [x ] "Implementation History" section is up-to-date for milestone
52
46
- [ ] User-facing documentation has been created in [ kubernetes/website] , for publication to [ kubernetes.io]
53
47
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
54
48
55
- <!--
56
- **Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
57
- -->
58
-
59
49
[ kubernetes.io ] : https://kubernetes.io/
60
50
[ kubernetes/enhancements ] : https://git.k8s.io/enhancements
61
51
[ kubernetes/kubernetes ] : https://git.k8s.io/kubernetes
@@ -133,13 +123,27 @@ KEP focuses more on the "What" aspects rather than the "How".
133
123
signing] ( https://github.com/sigstore/cosign/blob/3f83940/KEYLESS.md ) to
134
124
minimize the attack surface of the supply chain.
135
125
126
+ ### Test Plan
127
+
128
+ Testing of the lower-level signing implementation will be done by writing unit tests
129
+ as well as integration tests within the
130
+ [ release-sdk] ( https://github.com/kubernetes-sigs/release-sdk ) repository. This
131
+ implementation is going to be used by
132
+ [ krel] ( https://github.com/kubernetes/release/blob/master/docs/krel/README.md )
133
+ during the release creation process, which is tested separately. The overall
134
+ integration into krel can be tested manually by the Release Managers as well,
135
+ while we use the pre-releases of v1.24 as first instance for full end-to-end
136
+ feedback.
137
+
136
138
### Graduation Criteria
137
139
138
140
#### Alpha
139
141
140
142
- Outline and integrate an example process for signing Kubernetes release
141
143
artifacts.
142
144
145
+ Tracking issue: https://github.com/kubernetes/release/issues/2383
146
+
143
147
#### Beta
144
148
145
149
- Standard Kubernetes release artifacts (binaries and container images) are
@@ -150,6 +154,272 @@ KEP focuses more on the "What" aspects rather than the "How".
150
154
- All Kubernetes artifacts are signed. This does exclude everything which gets
151
155
build outside of the main Kubernetes repository.
152
156
157
+ ## Production Readiness Review Questionnaire
158
+
159
+ ### Feature Enablement and Rollback
160
+
161
+ ###### How can this feature be enabled / disabled in a live cluster?
162
+
163
+ Signed images have not to be verified, so they do not interfere with a running
164
+ cluster at all. They can be verified manually or by using the tooling provided
165
+ by our documentation.
166
+
167
+ ###### Does enabling the feature change any default behavior?
168
+
169
+ Not when a manual verification will be done. If the cluster will change its
170
+ configuration to only accept signed images, then invalid signatures will cause
171
+ the container runtime to refuse the image pull. The same behavior could be
172
+ achieved by using an admission webhook which verifies the signature.
173
+
174
+ ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
175
+
176
+ Yes, depending on how the signature verification will be done.
177
+
178
+ ###### What happens if we reenable the feature if it was previously rolled back?
179
+
180
+ It will behave in the same way as enabled initially.
181
+
182
+ ###### Are there any tests for feature enablement/disablement?
183
+
184
+ No, not on a cluster level. We test the signatures during the release process.
185
+
186
+ ### Rollout, Upgrade and Rollback Planning
187
+
188
+ <!--
189
+ This section must be completed when targeting beta to a release.
190
+ -->
191
+
192
+ ###### How can a rollout or rollback fail? Can it impact already running workloads?
193
+
194
+ <!--
195
+ Try to be as paranoid as possible - e.g., what if some components will restart
196
+ mid-rollout?
197
+
198
+ Be sure to consider highly-available clusters, where, for example,
199
+ feature flags will be enabled on some API servers and not others during the
200
+ rollout. Similarly, consider large clusters and how enablement/disablement
201
+ will rollout across nodes.
202
+ -->
203
+
204
+ ###### What specific metrics should inform a rollback?
205
+
206
+ <!--
207
+ What signals should users be paying attention to when the feature is young
208
+ that might indicate a serious problem?
209
+ -->
210
+
211
+ ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
212
+
213
+ <!--
214
+ Describe manual testing that was done and the outcomes.
215
+ Longer term, we may want to require automated upgrade/rollback tests, but we
216
+ are missing a bunch of machinery and tooling and can't do that now.
217
+ -->
218
+
219
+ ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
220
+
221
+ <!--
222
+ Even if applying deprecation policies, they may still surprise some users.
223
+ -->
224
+
225
+ ### Monitoring Requirements
226
+
227
+ <!--
228
+ This section must be completed when targeting beta to a release.
229
+ -->
230
+
231
+ ###### How can an operator determine if the feature is in use by workloads?
232
+
233
+ <!--
234
+ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
235
+ checking if there are objects with field X set) may be a last resort. Avoid
236
+ logs or events for this purpose.
237
+ -->
238
+
239
+ ###### How can someone using this feature know that it is working for their instance?
240
+
241
+ <!--
242
+ For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
243
+ for each individual pod.
244
+ Pick one more of these and delete the rest.
245
+ Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
246
+ and operation of this feature.
247
+ Recall that end users cannot usually observe component logs or access metrics.
248
+ -->
249
+
250
+ - [ ] Events
251
+ - Event Reason:
252
+ - [ ] API .status
253
+ - Condition name:
254
+ - Other field:
255
+ - [ ] Other (treat as last resort)
256
+ - Details:
257
+
258
+ ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
259
+
260
+ <!--
261
+ This is your opportunity to define what "normal" quality of service looks like
262
+ for a feature.
263
+
264
+ It's impossible to provide comprehensive guidance, but at the very
265
+ high level (needs more precise definitions) those may be things like:
266
+ - per-day percentage of API calls finishing with 5XX errors <= 1%
267
+ - 99% percentile over day of absolute value from (job creation time minus expected
268
+ job creation time) for cron job <= 10%
269
+ - 99.9% of /health requests per day finish with 200 code
270
+
271
+ These goals will help you determine what you need to measure (SLIs) in the next
272
+ question.
273
+ -->
274
+
275
+ ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
276
+
277
+ <!--
278
+ Pick one more of these and delete the rest.
279
+ -->
280
+
281
+ - [ ] Metrics
282
+ - Metric name:
283
+ - [ Optional] Aggregation method:
284
+ - Components exposing the metric:
285
+ - [ ] Other (treat as last resort)
286
+ - Details:
287
+
288
+ ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
289
+
290
+ <!--
291
+ Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
292
+ implementation difficulties, etc.).
293
+ -->
294
+
295
+ ### Dependencies
296
+
297
+ <!--
298
+ This section must be completed when targeting beta to a release.
299
+ -->
300
+
301
+ ###### Does this feature depend on any specific services running in the cluster?
302
+
303
+ <!--
304
+ Think about both cluster-level services (e.g. metrics-server) as well
305
+ as node-level agents (e.g. specific version of CRI). Focus on external or
306
+ optional services that are needed. For example, if this feature depends on
307
+ a cloud provider API, or upon an external software-defined storage or network
308
+ control plane.
309
+
310
+ For each of these, fill in the following—thinking about running existing user workloads
311
+ and creating new ones, as well as about cluster-level services (e.g. DNS):
312
+ - [Dependency name]
313
+ - Usage description:
314
+ - Impact of its outage on the feature:
315
+ - Impact of its degraded performance or high-error rates on the feature:
316
+ -->
317
+
318
+ ### Scalability
319
+
320
+ <!--
321
+ For alpha, this section is encouraged: reviewers should consider these questions
322
+ and attempt to answer them.
323
+
324
+ For beta, this section is required: reviewers must answer these questions.
325
+
326
+ For GA, this section is required: approvers should be able to confirm the
327
+ previous answers based on experience in the field.
328
+ -->
329
+
330
+ ###### Will enabling / using this feature result in any new API calls?
331
+
332
+ <!--
333
+ Describe them, providing:
334
+ - API call type (e.g. PATCH pods)
335
+ - estimated throughput
336
+ - originating component(s) (e.g. Kubelet, Feature-X-controller)
337
+ Focusing mostly on:
338
+ - components listing and/or watching resources they didn't before
339
+ - API calls that may be triggered by changes of some Kubernetes resources
340
+ (e.g. update of object X triggers new updates of object Y)
341
+ - periodic API calls to reconcile state (e.g. periodic fetching state,
342
+ heartbeats, leader election, etc.)
343
+ -->
344
+
345
+ ###### Will enabling / using this feature result in introducing new API types?
346
+
347
+ <!--
348
+ Describe them, providing:
349
+ - API type
350
+ - Supported number of objects per cluster
351
+ - Supported number of objects per namespace (for namespace-scoped objects)
352
+ -->
353
+
354
+ ###### Will enabling / using this feature result in any new calls to the cloud provider?
355
+
356
+ <!--
357
+ Describe them, providing:
358
+ - Which API(s):
359
+ - Estimated increase:
360
+ -->
361
+
362
+ ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
363
+
364
+ <!--
365
+ Describe them, providing:
366
+ - API type(s):
367
+ - Estimated increase in size: (e.g., new annotation of size 32B)
368
+ - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
369
+ -->
370
+
371
+ ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
372
+
373
+ <!--
374
+ Look at the [existing SLIs/SLOs].
375
+
376
+ Think about adding additional work or introducing new steps in between
377
+ (e.g. need to do X to start a container), etc. Please describe the details.
378
+
379
+ [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
380
+ -->
381
+
382
+ ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
383
+
384
+ <!--
385
+ Things to keep in mind include: additional in-memory state, additional
386
+ non-trivial computations, excessive access to disks (including increased log
387
+ volume), significant amount of data sent and/or received over network, etc.
388
+ This through this both in small and large cases, again with respect to the
389
+ [supported limits].
390
+
391
+ [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
392
+ -->
393
+
394
+ ### Troubleshooting
395
+
396
+ <!--
397
+ This section must be completed when targeting beta to a release.
398
+
399
+ The Troubleshooting section currently serves the `Playbook` role. We may consider
400
+ splitting it into a dedicated `Playbook` document (potentially with some monitoring
401
+ details). For now, we leave it here.
402
+ -->
403
+
404
+ ###### How does this feature react if the API server and/or etcd is unavailable?
405
+
406
+ ###### What are other known failure modes?
407
+
408
+ <!--
409
+ For each of them, fill in the following information by copying the below template:
410
+ - [Failure mode brief description]
411
+ - Detection: How can it be detected via metrics? Stated another way:
412
+ how can an operator troubleshoot without logging into a master or worker node?
413
+ - Mitigations: What can be done to stop the bleeding, especially for already
414
+ running user workloads?
415
+ - Diagnostics: What are the useful log messages and their required logging
416
+ levels that could help debug the issue?
417
+ Not required until feature graduated to beta.
418
+ - Testing: Are there any tests for failure mode? If not, describe why.
419
+ -->
420
+
421
+ ###### What steps should be taken if SLOs are not being met to determine the problem?
422
+
153
423
## Drawbacks
154
424
155
425
- The initial implementation effort from the release engineering perspective
@@ -162,4 +432,5 @@ KEP focuses more on the "What" aspects rather than the "How".
162
432
163
433
## Implementation History
164
434
435
+ - 2022-01-27 Updated to contain test plan and correct milestones
165
436
- 2021-11-29 Initial Draft
0 commit comments