Skip to content

Commit 7f9e29a

Browse files
authored
Merge pull request kubernetes#3191 from saschagrunert/signing-update-and-test-plan
Update container image signing KEP
2 parents 9ebdce8 + 6ca1c76 commit 7f9e29a

File tree

3 files changed

+305
-29
lines changed

3 files changed

+305
-29
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 3031
2+
alpha:
3+
approver: "@ehashman"

keps/sig-release/3031-signing-release-artifacts/README.md

Lines changed: 297 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -10,52 +10,42 @@
1010
- [Proposal](#proposal)
1111
- [User Stories (Optional)](#user-stories-optional)
1212
- [Risks and Mitigations](#risks-and-mitigations)
13+
- [Test Plan](#test-plan)
1314
- [Graduation Criteria](#graduation-criteria)
1415
- [Alpha](#alpha)
1516
- [Beta](#beta)
1617
- [GA](#ga)
18+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
19+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
20+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
21+
- [Monitoring Requirements](#monitoring-requirements)
22+
- [Dependencies](#dependencies)
23+
- [Scalability](#scalability)
24+
- [Troubleshooting](#troubleshooting)
1725
- [Drawbacks](#drawbacks)
1826
- [Alternatives](#alternatives)
1927
- [Implementation History](#implementation-history)
2028
<!-- /toc -->
2129

2230
## Release Signoff Checklist
2331

24-
<!--
25-
**ACTION REQUIRED:** In order to merge code into a release, there must be an
26-
issue in [kubernetes/enhancements] referencing this KEP and targeting a release
27-
milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
28-
of the targeted release**.
29-
30-
For enhancements that make changes to code or processes/procedures in core
31-
Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
32-
Signoff checklist to be completed.
33-
34-
Check these off as they are completed for the Release Team to track. These
35-
checklist items _must_ be updated for the enhancement to be released.
36-
-->
37-
3832
Items marked with (R) are required _prior to targeting to a milestone / release_.
3933

40-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
41-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
42-
- [ ] (R) Design details are appropriately documented
43-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
34+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
35+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
36+
- [x] (R) Design details are appropriately documented
37+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
4438
- [ ] e2e Tests for all Beta API Operations (endpoints)
4539
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4640
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
47-
- [ ] (R) Graduation criteria is in place
41+
- [x] (R) Graduation criteria is in place
4842
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
49-
- [ ] (R) Production readiness review completed
50-
- [ ] (R) Production readiness review approved
51-
- [ ] "Implementation History" section is up-to-date for milestone
43+
- [x] (R) Production readiness review completed
44+
- [x] (R) Production readiness review approved
45+
- [x] "Implementation History" section is up-to-date for milestone
5246
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
5347
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5448

55-
<!--
56-
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
57-
-->
58-
5949
[kubernetes.io]: https://kubernetes.io/
6050
[kubernetes/enhancements]: https://git.k8s.io/enhancements
6151
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
@@ -133,13 +123,27 @@ KEP focuses more on the "What" aspects rather than the "How".
133123
signing](https://github.com/sigstore/cosign/blob/3f83940/KEYLESS.md) to
134124
minimize the attack surface of the supply chain.
135125

126+
### Test Plan
127+
128+
Testing of the lower-level signing implementation will be done by writing unit tests
129+
as well as integration tests within the
130+
[release-sdk](https://github.com/kubernetes-sigs/release-sdk) repository. This
131+
implementation is going to be used by
132+
[krel](https://github.com/kubernetes/release/blob/master/docs/krel/README.md)
133+
during the release creation process, which is tested separately. The overall
134+
integration into krel can be tested manually by the Release Managers as well,
135+
while we use the pre-releases of v1.24 as first instance for full end-to-end
136+
feedback.
137+
136138
### Graduation Criteria
137139

138140
#### Alpha
139141

140142
- Outline and integrate an example process for signing Kubernetes release
141143
artifacts.
142144

145+
Tracking issue: https://github.com/kubernetes/release/issues/2383
146+
143147
#### Beta
144148

145149
- Standard Kubernetes release artifacts (binaries and container images) are
@@ -150,6 +154,272 @@ KEP focuses more on the "What" aspects rather than the "How".
150154
- All Kubernetes artifacts are signed. This does exclude everything which gets
151155
build outside of the main Kubernetes repository.
152156

157+
## Production Readiness Review Questionnaire
158+
159+
### Feature Enablement and Rollback
160+
161+
###### How can this feature be enabled / disabled in a live cluster?
162+
163+
Signed images have not to be verified, so they do not interfere with a running
164+
cluster at all. They can be verified manually or by using the tooling provided
165+
by our documentation.
166+
167+
###### Does enabling the feature change any default behavior?
168+
169+
Not when a manual verification will be done. If the cluster will change its
170+
configuration to only accept signed images, then invalid signatures will cause
171+
the container runtime to refuse the image pull. The same behavior could be
172+
achieved by using an admission webhook which verifies the signature.
173+
174+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
175+
176+
Yes, depending on how the signature verification will be done.
177+
178+
###### What happens if we reenable the feature if it was previously rolled back?
179+
180+
It will behave in the same way as enabled initially.
181+
182+
###### Are there any tests for feature enablement/disablement?
183+
184+
No, not on a cluster level. We test the signatures during the release process.
185+
186+
### Rollout, Upgrade and Rollback Planning
187+
188+
<!--
189+
This section must be completed when targeting beta to a release.
190+
-->
191+
192+
###### How can a rollout or rollback fail? Can it impact already running workloads?
193+
194+
<!--
195+
Try to be as paranoid as possible - e.g., what if some components will restart
196+
mid-rollout?
197+
198+
Be sure to consider highly-available clusters, where, for example,
199+
feature flags will be enabled on some API servers and not others during the
200+
rollout. Similarly, consider large clusters and how enablement/disablement
201+
will rollout across nodes.
202+
-->
203+
204+
###### What specific metrics should inform a rollback?
205+
206+
<!--
207+
What signals should users be paying attention to when the feature is young
208+
that might indicate a serious problem?
209+
-->
210+
211+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
212+
213+
<!--
214+
Describe manual testing that was done and the outcomes.
215+
Longer term, we may want to require automated upgrade/rollback tests, but we
216+
are missing a bunch of machinery and tooling and can't do that now.
217+
-->
218+
219+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
220+
221+
<!--
222+
Even if applying deprecation policies, they may still surprise some users.
223+
-->
224+
225+
### Monitoring Requirements
226+
227+
<!--
228+
This section must be completed when targeting beta to a release.
229+
-->
230+
231+
###### How can an operator determine if the feature is in use by workloads?
232+
233+
<!--
234+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
235+
checking if there are objects with field X set) may be a last resort. Avoid
236+
logs or events for this purpose.
237+
-->
238+
239+
###### How can someone using this feature know that it is working for their instance?
240+
241+
<!--
242+
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
243+
for each individual pod.
244+
Pick one more of these and delete the rest.
245+
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
246+
and operation of this feature.
247+
Recall that end users cannot usually observe component logs or access metrics.
248+
-->
249+
250+
- [ ] Events
251+
- Event Reason:
252+
- [ ] API .status
253+
- Condition name:
254+
- Other field:
255+
- [ ] Other (treat as last resort)
256+
- Details:
257+
258+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
259+
260+
<!--
261+
This is your opportunity to define what "normal" quality of service looks like
262+
for a feature.
263+
264+
It's impossible to provide comprehensive guidance, but at the very
265+
high level (needs more precise definitions) those may be things like:
266+
- per-day percentage of API calls finishing with 5XX errors <= 1%
267+
- 99% percentile over day of absolute value from (job creation time minus expected
268+
job creation time) for cron job <= 10%
269+
- 99.9% of /health requests per day finish with 200 code
270+
271+
These goals will help you determine what you need to measure (SLIs) in the next
272+
question.
273+
-->
274+
275+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
276+
277+
<!--
278+
Pick one more of these and delete the rest.
279+
-->
280+
281+
- [ ] Metrics
282+
- Metric name:
283+
- [Optional] Aggregation method:
284+
- Components exposing the metric:
285+
- [ ] Other (treat as last resort)
286+
- Details:
287+
288+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
289+
290+
<!--
291+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
292+
implementation difficulties, etc.).
293+
-->
294+
295+
### Dependencies
296+
297+
<!--
298+
This section must be completed when targeting beta to a release.
299+
-->
300+
301+
###### Does this feature depend on any specific services running in the cluster?
302+
303+
<!--
304+
Think about both cluster-level services (e.g. metrics-server) as well
305+
as node-level agents (e.g. specific version of CRI). Focus on external or
306+
optional services that are needed. For example, if this feature depends on
307+
a cloud provider API, or upon an external software-defined storage or network
308+
control plane.
309+
310+
For each of these, fill in the following—thinking about running existing user workloads
311+
and creating new ones, as well as about cluster-level services (e.g. DNS):
312+
- [Dependency name]
313+
- Usage description:
314+
- Impact of its outage on the feature:
315+
- Impact of its degraded performance or high-error rates on the feature:
316+
-->
317+
318+
### Scalability
319+
320+
<!--
321+
For alpha, this section is encouraged: reviewers should consider these questions
322+
and attempt to answer them.
323+
324+
For beta, this section is required: reviewers must answer these questions.
325+
326+
For GA, this section is required: approvers should be able to confirm the
327+
previous answers based on experience in the field.
328+
-->
329+
330+
###### Will enabling / using this feature result in any new API calls?
331+
332+
<!--
333+
Describe them, providing:
334+
- API call type (e.g. PATCH pods)
335+
- estimated throughput
336+
- originating component(s) (e.g. Kubelet, Feature-X-controller)
337+
Focusing mostly on:
338+
- components listing and/or watching resources they didn't before
339+
- API calls that may be triggered by changes of some Kubernetes resources
340+
(e.g. update of object X triggers new updates of object Y)
341+
- periodic API calls to reconcile state (e.g. periodic fetching state,
342+
heartbeats, leader election, etc.)
343+
-->
344+
345+
###### Will enabling / using this feature result in introducing new API types?
346+
347+
<!--
348+
Describe them, providing:
349+
- API type
350+
- Supported number of objects per cluster
351+
- Supported number of objects per namespace (for namespace-scoped objects)
352+
-->
353+
354+
###### Will enabling / using this feature result in any new calls to the cloud provider?
355+
356+
<!--
357+
Describe them, providing:
358+
- Which API(s):
359+
- Estimated increase:
360+
-->
361+
362+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
363+
364+
<!--
365+
Describe them, providing:
366+
- API type(s):
367+
- Estimated increase in size: (e.g., new annotation of size 32B)
368+
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
369+
-->
370+
371+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
372+
373+
<!--
374+
Look at the [existing SLIs/SLOs].
375+
376+
Think about adding additional work or introducing new steps in between
377+
(e.g. need to do X to start a container), etc. Please describe the details.
378+
379+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
380+
-->
381+
382+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
383+
384+
<!--
385+
Things to keep in mind include: additional in-memory state, additional
386+
non-trivial computations, excessive access to disks (including increased log
387+
volume), significant amount of data sent and/or received over network, etc.
388+
This through this both in small and large cases, again with respect to the
389+
[supported limits].
390+
391+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
392+
-->
393+
394+
### Troubleshooting
395+
396+
<!--
397+
This section must be completed when targeting beta to a release.
398+
399+
The Troubleshooting section currently serves the `Playbook` role. We may consider
400+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
401+
details). For now, we leave it here.
402+
-->
403+
404+
###### How does this feature react if the API server and/or etcd is unavailable?
405+
406+
###### What are other known failure modes?
407+
408+
<!--
409+
For each of them, fill in the following information by copying the below template:
410+
- [Failure mode brief description]
411+
- Detection: How can it be detected via metrics? Stated another way:
412+
how can an operator troubleshoot without logging into a master or worker node?
413+
- Mitigations: What can be done to stop the bleeding, especially for already
414+
running user workloads?
415+
- Diagnostics: What are the useful log messages and their required logging
416+
levels that could help debug the issue?
417+
Not required until feature graduated to beta.
418+
- Testing: Are there any tests for failure mode? If not, describe why.
419+
-->
420+
421+
###### What steps should be taken if SLOs are not being met to determine the problem?
422+
153423
## Drawbacks
154424

155425
- The initial implementation effort from the release engineering perspective
@@ -162,4 +432,5 @@ KEP focuses more on the "What" aspects rather than the "How".
162432

163433
## Implementation History
164434

435+
- 2022-01-27 Updated to contain test plan and correct milestones
165436
- 2021-11-29 Initial Draft

0 commit comments

Comments
 (0)