Skip to content

Commit 7abe3dd

Browse files
committed
KEP-2535: fill out PRR more
Signed-off-by: Peter Hunt <[email protected]>
1 parent f580413 commit 7abe3dd

File tree

1 file changed

+46
-30
lines changed
  • keps/sig-node/2535-ensure-secret-pulled-images

1 file changed

+46
-30
lines changed

keps/sig-node/2535-ensure-secret-pulled-images/README.md

Lines changed: 46 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -326,65 +326,79 @@ TBD subsequent to alpha
326326

327327
### Feature Enablement and Rollback
328328

329-
- At Alpha this feature will be disabled by default with a feature gate.
330-
- At Beta this feature will be enabled by default with the feature gate.
331-
- At GA the ability to gate the feature will be removed leaving the feature enabled.
332-
333329
###### How can this feature be enabled / disabled in a live cluster?
334330

335331
- [x] Feature gate (also fill in values in `kep.yaml`)
336332
- Feature gate name: KubeletEnsureSecretPulledImages
337-
- Components depending on the feature gate: kubelet
338-
333+
- Components depending on the feature gate: Kubelet
334+
- [x] Other
335+
- Describe the mechanism: Kubelet configuration field `pullImageSecretRecheck`
336+
- Will enabling / disabling the feature require downtime of the control
337+
plane?
338+
- No, only a restart of the kubelet
339+
- Will enabling / disabling the feature require downtime or reprovisioning
340+
of a node?
341+
- Yes, as the kubelet must be restarted.
339342

340343
###### Does enabling the feature change any default behavior?
341344

342-
Yes, see discussions above.
345+
Yes. The behavior of `IfNotPresent` and `Never` pull policies will change, incurring more image pulls and pod creation failures, respectively.
343346

344347
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
345348

346349
Yes.
347350

348351
###### What happens if we reenable the feature if it was previously rolled back?
349352

350-
Will go back to working as designed.
353+
Images pulled during the period the feature was disabled will not be present in the cache, and thus could incur redundant pulls/container creation failures.
354+
However, the cache may still be present, and thus it will retain information from when it was previously enabled.
351355

352356
###### Are there any tests for feature enablement/disablement?
353357

354358
Yes, tests run both enabled and disabled.
355359

356360
### Rollout, Upgrade and Rollback Planning
357361

358-
TBD
359-
360362
###### How can a rollout or rollback fail? Can it impact already running workloads?
361363

362-
TBD
364+
Rollout can fail if the registry isn't available when the kubelet is starting and attempting to create pods, in a similar way
365+
to how the kubelet will generally be more sensitive to registry downtime.
366+
367+
Rollback should not fail for this feature specifically. The kubelet will no longer use the cache to determine whether credentials were used, and
368+
the behavior of the pull policies will revert to the previous behavior.
363369

364370
###### What specific metrics should inform a rollback?
365371

366-
TBD needed for Beta
372+
If the feature gate is enabled, but the kubelet configuration field is not enabled, the kubelet will gather metrics `image_pull_secret_recheck_miss` and
373+
`image_pull_secret_recheck_hit` which will be both be a histogram counting the number of images that had a cache miss (despite the image potentially being present).
374+
375+
This will allow an admin to see how many images would have reauthorization checks done.
376+
377+
A histogram was chosen to allow an admin to compare registry uptime with cache misses, as the main failure scenerio is registry unavailability
378+
could cause pods not to come up, because the kubelet doesn't have credentials cached.
367379

368380
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
369381

370-
TBD
382+
They can be. The presence of a feature gate and kubelet configuration will make this path safe. Plus, there are no API objects that cause issue
371383

372384
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
373385

374-
TBD
386+
No
375387

376388
### Monitoring Requirements
377389

378-
TBD
379-
380390
###### How can an operator determine if the feature is in use by workloads?
381391

382-
For alpha can check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
383-
using the pull if not present image pull policy. Will show up as network events. Though only the manifests will be
384-
revalidated against the container image repository, large contents will not be pulled. Thus one could monitor traffic
385-
to the registry.
392+
When the feature is enabled, the kubelet will emit a metric `image_pull_secret_recheck_miss` and `image_pull_secret_recheck_hit` that will happen when a cache miss happens.
393+
This will happen regardless of whether the feature is enabled in the kubelet via its configuration flag.
394+
395+
To determine if the feature is actually working, they will have to check manually.
396+
397+
A user could check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
398+
using the pull if not present image pull policy.
386399

387-
For beta will add metrics allowing an admin to determine how often an image has been reauthenticated to an image registry because of cache expiration or due to reuse across pods that have different authentication information. Success metrics will also be provided highlighting cache hits.
400+
It also will show up as network events. Though only the manifests will be revalidated against the container image repository,
401+
large contents will not be pulled. Thus one could monitor traffic to the registry.
388402

389403
###### How can someone using this feature know that it is working for their instance?
390404

@@ -433,34 +447,36 @@ No.
433447

434448
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
435449

450+
No existing API objects will be unchanged.
451+
452+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
453+
436454
Yes. When enabled, and when container images have been pulled with image pull secrets (credentials), subsequent image
437455
pulls for pods that do not contain the image pull secret that successfully pulled the image will have to authenticate
438456
by trying to pull the image manifests from the registry. The image layers do not have to be re-pulled, just the
439457
manifests for authentication purposes.
440458

441-
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
442-
443-
When switched on see above.
459+
However, this registry round-trip will slow down the pod creation process. This slowdown is the expense of the added security of this feature.
444460

445461
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
446462

447-
When switched on see above.
463+
Similar to the increased time above, there will also be a CPU/memory/IO cost when the kubelet instructs the CRI implementation to repull the image
464+
redundantly.
448465

449466
### Troubleshooting
450467

451-
TBD
452-
453468
###### How does this feature react if the API server and/or etcd is unavailable?
454469

455-
TBD
470+
This feature doesn't interact with the API server or etcd.
456471

457472
###### What are other known failure modes?
458473

459-
TBD
474+
A registry being unavailable is going to be a common failure mode for this feature. Unfortunately, this is the cost of this feature. The kubelet
475+
needs to go through the authentication process redundantly, and that will mean the cluster will be more sensitive to registry downtime.
460476

461477
###### What steps should be taken if SLOs are not being met to determine the problem?
462478

463-
Check logs.
479+
Reduce the number of cache misses (as seen through the metrics) by ensuring similar credentials are shared among images.
464480

465481
## Implementation History
466482

0 commit comments

Comments
 (0)