Skip to content

Commit 459359d

Browse files
authored
Merge pull request kubernetes#1608 from mikebrow/ensure-secret-pulled-image
ensure secret pulled images
2 parents bc9d6c9 + 64a17e7 commit 459359d

File tree

3 files changed

+421
-0
lines changed

3 files changed

+421
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2535
2+
alpha:
3+
approver: "@johnbelamaric"
Lines changed: 387 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,387 @@
1+
# KEP-2535: Ensure Secret Pulled Images
2+
3+
## Table of Contents
4+
5+
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
7+
- [Summary](#summary)
8+
- [Motivation](#motivation)
9+
- [Goals](#goals)
10+
- [Non-Goals](#non-goals)
11+
- [Proposal](#proposal)
12+
- [User Stories](#user-stories)
13+
- [Story 1](#story-1)
14+
- [Story 2](#story-2)
15+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
16+
- [Risks and Mitigations](#risks-and-mitigations)
17+
- [Design Details](#design-details)
18+
- [Test Plan](#test-plan)
19+
- [Graduation Criteria](#graduation-criteria)
20+
- [Alpha](#alpha)
21+
- [Deprecation](#deprecation)
22+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
23+
- [Version Skew Strategy](#version-skew-strategy)
24+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
25+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
26+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
27+
- [Monitoring Requirements](#monitoring-requirements)
28+
- [Dependencies](#dependencies)
29+
- [Scalability](#scalability)
30+
- [Troubleshooting](#troubleshooting)
31+
- [Implementation History](#implementation-history)
32+
- [Drawbacks [optional]](#drawbacks-optional)
33+
- [Alternatives [optional]](#alternatives-optional)
34+
- [Infrastructure Needed [optional]](#infrastructure-needed-optional)
35+
<!-- /toc -->
36+
37+
## Release Signoff Checklist
38+
39+
Items marked with (R) are required *prior to targeting to a milestone / release*.
40+
41+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
42+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
43+
- [ ] (R) Design details are appropriately documented
44+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
45+
- [ ] e2e Tests for all Beta API Operations (endpoints)
46+
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
47+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
48+
- [ ] (R) Graduation criteria is in place
49+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
50+
- [ ] (R) Production readiness review completed
51+
- [ ] (R) Production readiness review approved
52+
- [ ] "Implementation History" section is up-to-date for milestone
53+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
54+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
55+
56+
57+
[kubernetes.io]: https://kubernetes.io/
58+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
59+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
60+
[kubernetes/website]: https://git.k8s.io/website
61+
62+
63+
## Summary
64+
65+
We will add support in kubelet for the pullIfNotPresent image pull policy, for
66+
ensuring images pulled with pod imagePullSecrets are re-authenticated for other
67+
pods that do not have the same imagePullSecret/auths used to successfully pull
68+
the images in the first place.
69+
70+
This policy change will have no affect on the `pull always` image pull
71+
policy or for images that are preloaded.
72+
73+
However, for the `pull never` policy if a first pod successfully pulled an image
74+
with credential and then a second pod with pull never tried to use the image,
75+
when the feature gate is on the second pod will receive an error message, where
76+
before and with the feature gate off the second pod would be able to use the image
77+
pulled with credentials by the first pod.
78+
79+
This new feature will be enabled with a feature gate in alpha. This feature
80+
improves the security posture for privacy/security of image contents by forcing
81+
images pulled with an imagePullSecret/auth of a first pod to be re-authenticated
82+
for a second pod even if the image is already present through the secure pull of
83+
the first pod.
84+
85+
The new behavior means that if a first pod results in an image pulled with
86+
imagePullSecrets a second pod would have to also have rights to the image in
87+
order to use a present image.
88+
89+
This means that the image pull policy alwaysPull would no longer be required in
90+
every scenario to ensure image access rights by pods.
91+
92+
## Motivation
93+
94+
There have been customer requests for improving upon kubernetes' ability to
95+
secure images pulled with auth. on a node. Issue
96+
[#18787](https://github.com/kubernetes/kubernetes/issues/18787) has been around
97+
for a while.
98+
99+
To secure images one currently needs to inject `AllwaysPullImages` into pod
100+
specs via an admission plugin. As @liggitt [notes](https://github.com/kubernetes/kubernetes/issues/18787#issuecomment-532280931)
101+
the `pull` does not re-pull already-pulled layers of the image, but simply
102+
resolves/verifies the image manifest has not changed in the registry (which
103+
incidentally requires authenticating to private registries, which enforces the
104+
image access). That means in the normal case (where the image has not changed
105+
since the last pull), the request size is O(kb). However, the `pull` does put
106+
the registry in the critical path of starting a container, since an unavailable
107+
registry will fail the pull image manifest check (with or without proper
108+
authentication.)
109+
110+
### Goals
111+
112+
Modify the current pullIfNotPresent policy management enforced by `kubelet` to
113+
ensure the images pulled with a secret by `kubelet` since boot. During the
114+
EnsureImagesExist step `kubelet` will require authentication of present images
115+
pulled with auth since boot.
116+
117+
Optimize to only force re-authentication for a pod container image when the
118+
secret used to pull the container image is not present. IOW if an image is
119+
pulled with authentication for a first pod, subsequent pods that have the same
120+
authentication information should not need to re-authenticate.
121+
122+
Images already present at boot or loaded externally to `kubelet` or successfully
123+
pulled through `kubelet` with no imagePullSecret/authentication required will
124+
not require authentication.
125+
126+
### Non-Goals
127+
128+
Out of scope for this KEP is an image caching policy that would direct container
129+
runtimes through the CRI wrt. how they should treat the caching of images on a
130+
node. Such as store for public use but only if encrypted. Or Store for private
131+
use un-encrypted...
132+
133+
## Proposal
134+
135+
For alpha `kubelet` will keep a list, since boot, of container images that required
136+
authentication and a list of the authentications that successfully pulled the image.
137+
For beta the list will be persisted across reboot of host, and restart of kubelet.
138+
Additionally, an API will be considered to manage the ensure metadata.
139+
140+
`kubelet` will ensure any image in the list is always pulled if an authentication
141+
used is not present, thus enforcing authentication / re-authentication.
142+
143+
144+
### User Stories
145+
146+
#### Story 1
147+
User with multiple tenants will be able to support all image pull policies without
148+
concern that one tenant will gain access to an image that they don't have rights to.
149+
150+
#### Story 2
151+
User will will no longer have to inject the Pull Always Image Pull Policy to
152+
ensure all tenants have rights to the images that are already present on a host.
153+
154+
155+
### Notes/Constraints/Caveats (Optional)
156+
157+
With the default of the feature gate being off, users / cloud providers will have
158+
to set the feature gate to true to gain these this Secure by Default benefit.
159+
160+
### Risks and Mitigations
161+
162+
Image authentications with a registry may expire. To mitigate expirations a
163+
a timeout could be used to force re-authentication. The timeout could be a
164+
container runtime feature or a `kubelet` feature. If at the container runtime,
165+
images would not be present during the EnsureImagesExist step, thus would have
166+
to be pulled and authenticated if necessary. This timeout feature will be
167+
implemented in beta.
168+
169+
Since images can be pre-loaded, loaded outside the `kubelet` process, and
170+
garbage collected.. the list of images that required authentication in `kubelet`
171+
will not be a source of truth for how all images were pulled that are in the
172+
container runtime cache. To mitigate, images can be garbage collected at boot.
173+
And for beta, we will persist ensure metadata across reboot of host, and restart
174+
of kubelet, and possibly look at a way to add ensure metadata for images loaded
175+
outside of kubelet. In beta we will add a switch to enable re-auth on boot for
176+
admins seeking that instead of having to garbage collect where they do not use
177+
or expect preloaded images since boot.
178+
179+
180+
## Design Details
181+
182+
Kubelet will track, in memory, a hash map for the credentials that were successfully used to pull an image. The hash map
183+
will not be persisted to disk, in alpha. For alpha explicitly, we will not reuse or add other state manager concepts to kubelet.
184+
185+
See PR for detailed design / behavior documentation.
186+
187+
### Test Plan
188+
189+
For alpha, exhaustive Kubelet unit tests will be provided. Functions affected by the feature gate will be run with the feature gate on and with the feature gate off. Unit buckets will be provided for:
190+
- HashAuth - (new, small) returns a hash code for a CRI pull image auth [link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-ca08601dfd2fdf846f066d0338dc332beddd5602ab3a71b8fac95b419842da63R704-R751)
191+
- shouldPullImage - (modified, large sized change) determines if image should be pulled based on presence, and image pull policy, and now with the feature gate on if the image has been pulled/ensured by a secret. A unit test bucket did not exist for this function. The unit bucket will cover a matrix for:
192+
```
193+
pullIfNotPresent := &v1.Container{
194+
..
195+
ImagePullPolicy: v1.PullIfNotPresent,
196+
}
197+
pullNever := &v1.Container{
198+
..
199+
ImagePullPolicy: v1.PullNever,
200+
}
201+
pullAlways := &v1.Container{
202+
..
203+
ImagePullPolicy: v1.PullAlways,
204+
}
205+
tests := []struct {
206+
description string
207+
container *v1.Container
208+
imagePresent bool
209+
pulledBySecret bool
210+
ensuredBySecret bool
211+
expectedWithFGOff bool
212+
expectedWithFGOn bool
213+
}
214+
```
215+
[TestShouldPullImage link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-7297f08c72da9bf6479e80c03b45e24ea92ccb11c0031549e51b51f88a91f813R311-R438)
216+
217+
At beta we should revisit if integration buckets are warranted for e2e node and/or cri-tools/critest, and after gathering feedback.
218+
219+
### Graduation Criteria
220+
221+
#### Alpha
222+
223+
- Feature implemented behind a feature flag - KubeletEnsureSecretPulledImages
224+
- Initial e2e tests completed and enabled - No additional e2e identified as yet
225+
226+
#### Deprecation
227+
228+
N/A in alpha
229+
230+
### Upgrade / Downgrade Strategy
231+
232+
### Version Skew Strategy
233+
234+
N/A for alpha
235+
236+
## Production Readiness Review Questionnaire
237+
238+
### Feature Enablement and Rollback
239+
240+
###### How can this feature be enabled / disabled in a live cluster?
241+
242+
- [x] Feature gate (also fill in values in `kep.yaml`)
243+
- Feature gate name: KubeletEnsureSecretPulledImages
244+
- Components depending on the feature gate: kubelet
245+
246+
247+
###### Does enabling the feature change any default behavior?
248+
249+
Yes, see discussions above.
250+
251+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
252+
253+
Yes.
254+
255+
###### What happens if we reenable the feature if it was previously rolled back?
256+
257+
Will go back to working as designed.
258+
259+
###### Are there any tests for feature enablement/disablement?
260+
261+
Yes, tests run both enabled and disabled.
262+
263+
### Rollout, Upgrade and Rollback Planning
264+
N/A
265+
266+
###### How can a rollout or rollback fail? Can it impact already running workloads?
267+
268+
N/A
269+
270+
###### What specific metrics should inform a rollback?
271+
272+
N/A
273+
274+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
275+
276+
N/A
277+
278+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
279+
280+
N/A
281+
282+
### Monitoring Requirements
283+
284+
N/A
285+
286+
###### How can an operator determine if the feature is in use by workloads?
287+
288+
Can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
289+
using the pull if not present image pull policy. Will show up as network events. Though only the manifests will be
290+
revalidated against the container image repository, large contents will not be pulled. Thus one could monitor traffic
291+
to the registry.
292+
293+
###### How can someone using this feature know that it is working for their instance?
294+
295+
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
296+
where the image is present and the image pull policy is if not present.
297+
298+
- [x] Events
299+
- Event Reason: "kubelet Failed to pull image" ... "unexpected status code [manifests ...]: 401 Unauthorized"
300+
301+
302+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
303+
304+
N/A
305+
306+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
307+
308+
N/A
309+
310+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
311+
312+
N/A
313+
314+
### Dependencies
315+
316+
N/A for alpha
317+
318+
###### Does this feature depend on any specific services running in the cluster?
319+
320+
No.
321+
322+
### Scalability
323+
324+
N/A
325+
326+
###### Will enabling / using this feature result in any new API calls?
327+
328+
No.
329+
330+
###### Will enabling / using this feature result in introducing new API types?
331+
332+
No.
333+
334+
###### Will enabling / using this feature result in any new calls to the cloud provider?
335+
336+
No.
337+
338+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
339+
340+
Yes. When enabled, and when container images have been pulled with image pull secrets (credentials), subsequent image
341+
pulls for pods that do not contain the image pull secret that successfully pulled the image will have to authenticate
342+
by trying to pull the image manifests from the registry. The image layers do not have to be re-pulled, just the
343+
manifests for authentication purposes.
344+
345+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
346+
347+
When switched on see above.
348+
349+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
350+
351+
When switched on see above.
352+
353+
### Troubleshooting
354+
355+
N/A
356+
357+
###### How does this feature react if the API server and/or etcd is unavailable?
358+
359+
N/A
360+
361+
###### What are other known failure modes?
362+
363+
N/A
364+
365+
###### What steps should be taken if SLOs are not being met to determine the problem?
366+
367+
Check logs.
368+
369+
## Implementation History
370+
371+
tbd
372+
373+
## Drawbacks [optional]
374+
375+
Why should this KEP _not_ be implemented. N/A
376+
377+
## Alternatives [optional]
378+
379+
- Make the behavior change enabled by default by changing the feature gate to true by default instead of false by default.
380+
- Discussions went back and forth on whether this should go directly to GA as a fix or alpha as a feature gate. It seems this should be the default security posture for pullIfNotPresent as it is not clear to admins/users that an image pulled by a first pod with authentication can be used by a second pod without authentication. The performance cost should be minimal as only the manifest needs to be re-authenticated. But after further review and discussion with MrunalP we'll go ahead and have a kubelet feature gate with default off for alpha in v1.23.
381+
- Set the flag at some other scope e.g. pod spec (doing it at the pod spec was rejected by SIG-Node).
382+
- For beta/ga we may revisit/replace the in memory hash map in kubelet design, with an extension to the CRI API for having the container runtime
383+
ensure the image instead of kubelet.
384+
385+
## Infrastructure Needed [optional]
386+
387+
tbd

0 commit comments

Comments
 (0)