You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/2535-ensure-secret-pulled-images/README.md
+41-24Lines changed: 41 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,7 +61,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
61
61
62
62
## Summary
63
63
64
-
We will add support in kubelet for the pullIfNotPresent image pull policy, for
64
+
We will add support in kubelet for the `pullIfNotPresent` image pull policy, for
65
65
ensuring images pulled with pod imagePullSecrets are re-authenticated for other
66
66
pods that do not have the same imagePullSecret/auths used to successfully pull
67
67
the images in the first place.
@@ -88,6 +88,8 @@ order to use a present image.
88
88
This means that the image pull policy alwaysPull would no longer be required in
89
89
every scenario to ensure image access rights by pods.
90
90
91
+
*** The issue and these changes improving the security posture without requiring the forcing of pull always, will be documented in the kubernetes image pull policy documentation. The new feature gate should also be documented in release notes. ***
92
+
91
93
## Motivation
92
94
93
95
There have been customer requests for improving upon kubernetes' ability to
@@ -159,11 +161,11 @@ to set the feature gate to true to gain these this Secure by Default benefit.
159
161
### Risks and Mitigations
160
162
161
163
Image authentications with a registry may expire. To mitigate expirations a
162
-
a timeout could be used to force re-authentication. The timeout could be a
164
+
a timeout will be used to force re-authentication. The timeout could be a
163
165
container runtime feature or a `kubelet` feature. If at the container runtime,
164
166
images would not be present during the EnsureImagesExist step, thus would have
165
167
to be pulled and authenticated if necessary. This timeout feature will be
166
-
implemented in beta.
168
+
implemented in alpha.
167
169
168
170
Since images can be pre-loaded, loaded outside the `kubelet` process, and
169
171
garbage collected.. the list of images that required authentication in `kubelet`
@@ -180,13 +182,19 @@ or expect preloaded images since boot.
180
182
181
183
Kubelet will track, in memory, a hash map for the credentials that were successfully used to pull an image. It has been decided that the hash map will be persisted to disk, in alpha.
182
184
185
+
The persisted "cache" will undergo cleanup operations on a timely basis (by default once an hour).
186
+
187
+
The persistence of the on storage cache is mainly for restarting kubelet and/or node reboot.
188
+
189
+
The max size of the cache will scale with the number of unique cache entries * the number of unique images that have not been garbage collected. It is not expected that this will be a significant number of bytes. Will be verified by actual use in Alpha and subsequent metrics in Beta.
190
+
183
191
See `/var/lib/kubelet/image_manager_state` in [kubernetes/kubernetes#114847](https://github.com/kubernetes/kubernetes/pull/114847)
> "authHash": { ** per review comment use SHA256 here vs hash **
190
198
> "115b8808c3e7f073": {
191
199
> "ensured": true,
192
200
> "dueDate": "2023-05-30T05:26:53.76740982+08:00"
@@ -203,7 +211,7 @@ See PR linked above for detailed design / behavior documentation.
203
211
### Test Plan
204
212
205
213
For alpha, exhaustive Kubelet unit tests will be provided. Functions affected by the feature gate will be run with the feature gate on and with the feature gate off. Unit buckets will be provided for:
206
-
- HashAuth - (new, small) returns a hash code for a CRI pull image auth [link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-ca08601dfd2fdf846f066d0338dc332beddd5602ab3a71b8fac95b419842da63R704-R751)
214
+
- HashAuth - (new, small) returns a hash code for a CRI pull image auth [link](https://github.com/kubernetes/kubernetes/pull/94899/files#diff-ca08601dfd2fdf846f066d0338dc332beddd5602ab3a71b8fac95b419842da63R704-R751) ** per review comment will use SHA256 **
207
215
- shouldPullImage - (modified, large sized change) determines if image should be pulled based on presence, and image pull policy, and now with the feature gate on if the image has been pulled/ensured by a secret. A unit test bucket did not exist for this function. The unit bucket will cover a matrix for:
208
216
```
209
217
pullIfNotPresent := &v1.Container{
@@ -230,7 +238,7 @@ For alpha, exhaustive Kubelet unit tests will be provided. Functions affected by
PersistHashMeta() ** will be persisting SHA256 entries vs hash **
234
242
235
243
At beta we should revisit if integration buckets are warranted for e2e node and/or cri-tools/critest, and after gathering feedback.
236
244
@@ -244,16 +252,21 @@ At beta we should revisit if integration buckets are warranted for e2e node and/
244
252
#### Deprecation
245
253
246
254
N/A in alpha
255
+
TBD subsequent to alpha
247
256
248
257
### Upgrade / Downgrade Strategy
249
258
250
259
### Version Skew Strategy
251
260
252
261
N/A for alpha
262
+
TBD subsequent to alpha
253
263
254
264
## Production Readiness Review Questionnaire
255
265
256
266
### Feature Enablement and Rollback
267
+
- At Alpha this feature will be disabled by default with a feature gate.
268
+
- At Beta this feature will be enabled by default with the feature gate.
269
+
- At GA the ability to gate the feature will be removed leaving the feature enabled.
257
270
258
271
###### How can this feature be enabled / disabled in a live cluster?
259
272
@@ -274,40 +287,44 @@ Yes.
274
287
275
288
Will go back to working as designed.
276
289
290
+
enj comment: Admin would need to go back to whatever old way they were using to enforce this image pull auth check. And also, as the feature is rolling out to kubelets (which is slow), they need to retain any API server based checks until rollout has completed.
291
+
277
292
###### Are there any tests for feature enablement/disablement?
278
293
279
294
Yes, tests run both enabled and disabled.
280
295
281
296
### Rollout, Upgrade and Rollback Planning
282
-
N/A
297
+
TBD
283
298
284
299
###### How can a rollout or rollback fail? Can it impact already running workloads?
285
300
286
-
N/A
301
+
TBD
287
302
288
303
###### What specific metrics should inform a rollback?
289
304
290
-
N/A
305
+
TBD needed for Beta
291
306
292
307
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
293
308
294
-
N/A
309
+
TBD
295
310
296
311
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
297
312
298
-
N/A
313
+
TBD
299
314
300
315
### Monitoring Requirements
301
316
302
-
N/A
317
+
TBD
303
318
304
319
###### How can an operator determine if the feature is in use by workloads?
305
320
306
-
Can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
321
+
For alpha can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
307
322
using the pull if not present image pull policy. Will show up as network events. Though only the manifests will be
308
323
revalidated against the container image repository, large contents will not be pulled. Thus one could monitor traffic
309
324
to the registry.
310
325
326
+
For beta will add metrics allowing an admin to determine how often an image has been reauthenticated to an image registry because of cache expiration or due to reuse across pods that have different authentication information. Success metrics will also be provided highlighting cache hits.
327
+
311
328
###### How can someone using this feature know that it is working for their instance?
312
329
313
330
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
@@ -319,27 +336,27 @@ where the image is present and the image pull policy is if not present.
319
336
320
337
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
321
338
322
-
N/A
339
+
TBD
323
340
324
341
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
325
342
326
-
N/A
343
+
TBD
327
344
328
345
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
329
346
330
-
N/A
347
+
TBD needed for Beta
331
348
332
349
### Dependencies
333
350
334
-
N/A for alpha
351
+
TBD
335
352
336
353
###### Does this feature depend on any specific services running in the cluster?
337
354
338
355
No.
339
356
340
357
### Scalability
341
358
342
-
N/A
359
+
TBD
343
360
344
361
###### Will enabling / using this feature result in any new API calls?
345
362
@@ -370,27 +387,27 @@ When switched on see above.
370
387
371
388
### Troubleshooting
372
389
373
-
N/A
390
+
TBD
374
391
375
392
###### How does this feature react if the API server and/or etcd is unavailable?
376
393
377
-
N/A
394
+
TBD
378
395
379
396
###### What are other known failure modes?
380
397
381
-
N/A
398
+
TBD
382
399
383
400
###### What steps should be taken if SLOs are not being met to determine the problem?
384
401
385
402
Check logs.
386
403
387
404
## Implementation History
388
405
389
-
tbd
406
+
TBD
390
407
391
408
## Drawbacks [optional]
392
409
393
-
Why should this KEP _not_ be implemented. N/A
410
+
Why should this KEP _not_ be implemented. TBD
394
411
395
412
## Alternatives [optional]
396
413
@@ -402,4 +419,4 @@ ensure the image instead of kubelet.
0 commit comments