You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Components depending on the feature gate: kubelet
338
-
333
+
- Components depending on the feature gate: Kubelet
334
+
-[x] Other
335
+
- Describe the mechanism: Kubelet configuration field `pullImageSecretRecheck`
336
+
- Will enabling / disabling the feature require downtime of the control
337
+
plane?
338
+
- No, only a restart of the kubelet
339
+
- Will enabling / disabling the feature require downtime or reprovisioning
340
+
of a node?
341
+
- Yes, as the kubelet must be restarted.
339
342
340
343
###### Does enabling the feature change any default behavior?
341
344
342
-
Yes, see discussions above.
345
+
Yes. The behavior of `IfNotPresent` and `Never` pull policies will change, incurring more image pulls and pod creation failures, respectively.
343
346
344
347
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
345
348
346
349
Yes.
347
350
348
351
###### What happens if we reenable the feature if it was previously rolled back?
349
352
350
-
Will go back to working as designed.
353
+
Images pulled during the period the feature was disabled will not be present in the cache, and thus could incur redundant pulls/container creation failures.
354
+
However, the cache may still be present, and thus it will retain information from when it was previously enabled.
351
355
352
356
###### Are there any tests for feature enablement/disablement?
353
357
354
358
Yes, tests run both enabled and disabled.
355
359
356
360
### Rollout, Upgrade and Rollback Planning
357
361
358
-
TBD
359
-
360
362
###### How can a rollout or rollback fail? Can it impact already running workloads?
361
363
362
-
TBD
364
+
Rollout can fail if the registry isn't available when the kubelet is starting and attempting to create pods, in a similar way
365
+
to how the kubelet will generally be more sensitive to registry downtime.
366
+
367
+
Rollback should not fail for this feature specifically. The kubelet will no longer use the cache to determine whether credentials were used, and
368
+
the behavior of the pull policies will revert to the previous behavior.
363
369
364
370
###### What specific metrics should inform a rollback?
365
371
366
-
TBD needed for Beta
372
+
If the feature gate is enabled, but the kubelet configuration field is not enabled, the kubelet will gather metrics `image_pull_secret_recheck_miss` and
373
+
`image_pull_secret_recheck_hit` which will be both be a histogram counting the number of images that had a cache miss (despite the image potentially being present).
374
+
375
+
This will allow an admin to see how many images would have reauthorization checks done.
376
+
377
+
A histogram was chosen to allow an admin to compare registry uptime with cache misses, as the main failure scenerio is registry unavailability
378
+
could cause pods not to come up, because the kubelet doesn't have credentials cached.
367
379
368
380
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
369
381
370
-
TBD
382
+
They can be. The presence of a feature gate and kubelet configuration will make this path safe. Plus, there are no API objects that cause issue
371
383
372
384
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
373
385
374
-
TBD
386
+
No
375
387
376
388
### Monitoring Requirements
377
389
378
-
TBD
379
-
380
390
###### How can an operator determine if the feature is in use by workloads?
381
391
382
-
For alpha can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
383
-
using the pull if not present image pull policy. Will show up as network events. Though only the manifests will be
384
-
revalidated against the container image repository, large contents will not be pulled. Thus one could monitor traffic
385
-
to the registry.
392
+
When the feature is enabled, the kubelet will emit a metric `image_pull_secret_recheck_miss` and `image_pull_secret_recheck_hit` that will happen when a cache miss happens.
393
+
This will happen regardless of whether the feature is enabled in the kubelet via its configuration flag.
394
+
395
+
To determine if the feature is actually working, they will have to check manually.
396
+
397
+
A user could check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
398
+
using the pull if not present image pull policy.
386
399
387
-
For beta will add metrics allowing an admin to determine how often an image has been reauthenticated to an image registry because of cache expiration or due to reuse across pods that have different authentication information. Success metrics will also be provided highlighting cache hits.
400
+
It also will show up as network events. Though only the manifests will be revalidated against the container image repository,
401
+
large contents will not be pulled. Thus one could monitor traffic to the registry.
388
402
389
403
###### How can someone using this feature know that it is working for their instance?
390
404
@@ -433,34 +447,36 @@ No.
433
447
434
448
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
435
449
450
+
No existing API objects will be unchanged.
451
+
452
+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
453
+
436
454
Yes. When enabled, and when container images have been pulled with image pull secrets (credentials), subsequent image
437
455
pulls for pods that do not contain the image pull secret that successfully pulled the image will have to authenticate
438
456
by trying to pull the image manifests from the registry. The image layers do not have to be re-pulled, just the
439
457
manifests for authentication purposes.
440
458
441
-
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
442
-
443
-
When switched on see above.
459
+
However, this registry round-trip will slow down the pod creation process. This slowdown is the expense of the added security of this feature.
444
460
445
461
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
446
462
447
-
When switched on see above.
463
+
Similar to the increased time above, there will also be a CPU/memory/IO cost when the kubelet instructs the CRI implementation to repull the image
464
+
redundantly.
448
465
449
466
### Troubleshooting
450
467
451
-
TBD
452
-
453
468
###### How does this feature react if the API server and/or etcd is unavailable?
454
469
455
-
TBD
470
+
This feature doesn't interact with the API server or etcd.
456
471
457
472
###### What are other known failure modes?
458
473
459
-
TBD
474
+
A registry being unavailable is going to be a common failure mode for this feature. Unfortunately, this is the cost of this feature. The kubelet
475
+
needs to go through the authentication process redundantly, and that will mean the cluster will be more sensitive to registry downtime.
460
476
461
477
###### What steps should be taken if SLOs are not being met to determine the problem?
462
478
463
-
Check logs.
479
+
Reduce the number of cache misses (as seen through the metrics) by ensuring similar credentials are shared among images.
0 commit comments