You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -62,7 +63,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
62
63
63
64
## Summary
64
65
65
-
We will add support in kubelet for the pullIfNotPresent image pull policy, for
66
+
We will add support in kubelet for the `pullIfNotPresent` image pull policy, for
66
67
ensuring images pulled with pod imagePullSecrets are re-authenticated for other
67
68
pods that do not have the same imagePullSecret/auths used to successfully pull
68
69
the images in the first place.
@@ -89,6 +90,8 @@ order to use a present image.
89
90
This means that the image pull policy alwaysPull would no longer be required in
90
91
every scenario to ensure image access rights by pods.
91
92
93
+
*** The issue and these changes improving the security posture without requiring the forcing of pull always, will be documented in the kubernetes image pull policy documentation. The new feature gate should also be documented in release notes. ***
94
+
92
95
## Motivation
93
96
94
97
There have been customer requests for improving upon kubernetes' ability to
@@ -132,10 +135,10 @@ use un-encrypted...
132
135
133
136
## Proposal
134
137
135
-
For alpha `kubelet` will keep a list, since boot, of container images that required
136
-
authentication and a list of the authentications that successfully pulled the image.
137
-
For beta the list will be persisted across reboot of host, and restart of kubelet.
138
-
Additionally, an API will be considered to manage the ensure metadata.
138
+
For alpha `kubelet` will keep a list, across reboots of host and restart of
139
+
kubelet, of container images that required authentication and a list of the
140
+
authentications that successfully pulled the image.
141
+
For beta an API will be considered to manage the ensure metadata.
139
142
140
143
`kubelet` will ensure any image in the list is always pulled if an authentication
141
144
used is not present, thus enforcing authentication / re-authentication.
@@ -160,17 +163,17 @@ to set the feature gate to true to gain these this Secure by Default benefit.
160
163
### Risks and Mitigations
161
164
162
165
Image authentications with a registry may expire. To mitigate expirations a
163
-
a timeout could be used to force re-authentication. The timeout could be a
166
+
a timeout will be used to force re-authentication. The timeout could be a
164
167
container runtime feature or a `kubelet` feature. If at the container runtime,
165
168
images would not be present during the EnsureImagesExist step, thus would have
166
169
to be pulled and authenticated if necessary. This timeout feature will be
167
-
implemented in beta.
170
+
implemented in alpha.
168
171
169
172
Since images can be pre-loaded, loaded outside the `kubelet` process, and
170
173
garbage collected.. the list of images that required authentication in `kubelet`
171
174
will not be a source of truth for how all images were pulled that are in the
172
175
container runtime cache. To mitigate, images can be garbage collected at boot.
173
-
And for beta, we will persist ensure metadata across reboot of host, and restart
176
+
And we will persist ensure metadata across reboot of host, and restart
174
177
of kubelet, and possibly look at a way to add ensure metadata for images loaded
175
178
outside of kubelet. In beta we will add a switch to enable re-auth on boot for
176
179
admins seeking that instead of having to garbage collect where they do not use
@@ -179,15 +182,47 @@ or expect preloaded images since boot.
179
182
180
183
## Design Details
181
184
182
-
Kubelet will track, in memory, a hash map for the credentials that were successfully used to pull an image. The hash map
183
-
will not be persisted to disk, in alpha. For alpha explicitly, we will not reuse or add other state manager concepts to kubelet.
185
+
Kubelet will track, in memory, a hash map for the credentials that were successfully used to pull an image. It has been decided that the hash map will be persisted to disk, in alpha.
186
+
187
+
The persisted "cache" will undergo cleanup operations on a timely basis (by default once an hour).
188
+
189
+
The persistence of the on storage cache is mainly for restarting kubelet and/or node reboot.
190
+
191
+
The max size of the cache will scale with the number of unique cache entries * the number of unique images that have not been garbage collected. It is not expected that this will be a significant number of bytes. Will be verified by actual use in Alpha and subsequent metrics in Beta.
192
+
193
+
See `/var/lib/kubelet/image_manager_state` in [kubernetes/kubernetes#114847](https://github.com/kubernetes/kubernetes/pull/114847)
See PR for detailed design / behavior documentation.
211
+
See PR linked above for detailed design / behavior documentation.
186
212
187
213
### Test Plan
188
214
215
+
[x] I/we understand the owners of the involved components may require updates to
216
+
existing tests to make this code solid enough prior to committing the changes
217
+
necessary to implement this enhancement.
218
+
219
+
##### Prerequisite testing updates
220
+
221
+
222
+
##### Unit tests
223
+
189
224
For alpha, exhaustive Kubelet unit tests will be provided. Functions affected by the feature gate will be run with the feature gate on and with the feature gate off. Unit buckets will be provided for:
190
-
- HashAuth - (new, small) returns a hash code for a CRI pull image auth [link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-ca08601dfd2fdf846f066d0338dc332beddd5602ab3a71b8fac95b419842da63R704-R751)
225
+
- HashAuth - (new, small) returns a hash code for a CRI pull image auth [link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-ca08601dfd2fdf846f066d0338dc332beddd5602ab3a71b8fac95b419842da63R704-R751) ** per review comment will use SHA256 **
191
226
- shouldPullImage - (modified, large sized change) determines if image should be pulled based on presence, and image pull policy, and now with the feature gate on if the image has been pulled/ensured by a secret. A unit test bucket did not exist for this function. The unit bucket will cover a matrix for:
192
227
```
193
228
pullIfNotPresent := &v1.Container{
@@ -214,7 +249,48 @@ For alpha, exhaustive Kubelet unit tests will be provided. Functions affected by
At beta we should revisit if integration buckets are warranted for e2e node and/or cri-tools/critest, and after gathering feedback.
252
+
PersistHashMeta() ** will be persisting SHA256 entries vs hash **
253
+
254
+
Additionally, for Alpha we will update this readme with an enumeration of the core packages being touched by the PR to implement this enhancement and provide the current unit coverage for those in the form of:
We expect no non-infra related flakes in the last month as a GA graduation criteria.
290
+
-->
291
+
At beta we will revisit if e2e buckets are warranted for e2e node, and after gathering feedback.
292
+
293
+
- <test>: <link to test coverage> (TBD)
218
294
219
295
### Graduation Criteria
220
296
@@ -226,16 +302,21 @@ At beta we should revisit if integration buckets are warranted for e2e node and/
226
302
#### Deprecation
227
303
228
304
N/A in alpha
305
+
TBD subsequent to alpha
229
306
230
307
### Upgrade / Downgrade Strategy
231
308
232
309
### Version Skew Strategy
233
310
234
311
N/A for alpha
312
+
TBD subsequent to alpha
235
313
236
314
## Production Readiness Review Questionnaire
237
315
238
316
### Feature Enablement and Rollback
317
+
- At Alpha this feature will be disabled by default with a feature gate.
318
+
- At Beta this feature will be enabled by default with the feature gate.
319
+
- At GA the ability to gate the feature will be removed leaving the feature enabled.
239
320
240
321
###### How can this feature be enabled / disabled in a live cluster?
241
322
@@ -261,35 +342,37 @@ Will go back to working as designed.
261
342
Yes, tests run both enabled and disabled.
262
343
263
344
### Rollout, Upgrade and Rollback Planning
264
-
N/A
345
+
TBD
265
346
266
347
###### How can a rollout or rollback fail? Can it impact already running workloads?
267
348
268
-
N/A
349
+
TBD
269
350
270
351
###### What specific metrics should inform a rollback?
271
352
272
-
N/A
353
+
TBD needed for Beta
273
354
274
355
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
275
356
276
-
N/A
357
+
TBD
277
358
278
359
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
279
360
280
-
N/A
361
+
TBD
281
362
282
363
### Monitoring Requirements
283
364
284
-
N/A
365
+
TBD
285
366
286
367
###### How can an operator determine if the feature is in use by workloads?
287
368
288
-
Can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
369
+
For alpha can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
289
370
using the pull if not present image pull policy. Will show up as network events. Though only the manifests will be
290
371
revalidated against the container image repository, large contents will not be pulled. Thus one could monitor traffic
291
372
to the registry.
292
373
374
+
For beta will add metrics allowing an admin to determine how often an image has been reauthenticated to an image registry because of cache expiration or due to reuse across pods that have different authentication information. Success metrics will also be provided highlighting cache hits.
375
+
293
376
###### How can someone using this feature know that it is working for their instance?
294
377
295
378
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
@@ -301,27 +384,27 @@ where the image is present and the image pull policy is if not present.
301
384
302
385
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
303
386
304
-
N/A
387
+
TBD
305
388
306
389
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
307
390
308
-
N/A
391
+
TBD
309
392
310
393
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
311
394
312
-
N/A
395
+
TBD needed for Beta
313
396
314
397
### Dependencies
315
398
316
-
N/A for alpha
399
+
TBD
317
400
318
401
###### Does this feature depend on any specific services running in the cluster?
319
402
320
403
No.
321
404
322
405
### Scalability
323
406
324
-
N/A
407
+
TBD
325
408
326
409
###### Will enabling / using this feature result in any new API calls?
327
410
@@ -352,27 +435,27 @@ When switched on see above.
352
435
353
436
### Troubleshooting
354
437
355
-
N/A
438
+
TBD
356
439
357
440
###### How does this feature react if the API server and/or etcd is unavailable?
358
441
359
-
N/A
442
+
TBD
360
443
361
444
###### What are other known failure modes?
362
445
363
-
N/A
446
+
TBD
364
447
365
448
###### What steps should be taken if SLOs are not being met to determine the problem?
366
449
367
450
Check logs.
368
451
369
452
## Implementation History
370
453
371
-
tbd
454
+
TBD
372
455
373
456
## Drawbacks [optional]
374
457
375
-
Why should this KEP _not_ be implemented. N/A
458
+
Why should this KEP _not_ be implemented. TBD
376
459
377
460
## Alternatives [optional]
378
461
@@ -384,4 +467,4 @@ ensure the image instead of kubelet.
0 commit comments