You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Items marked with (R) are required *prior to targeting to a milestone / release*.
53
52
54
-
-[] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
55
-
-[] (R) KEP approvers have approved the KEP status as `implementable`
56
-
-[] (R) Design details are appropriately documented
57
-
-[] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
53
+
-[x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
54
+
-[x] (R) KEP approvers have approved the KEP status as `implementable`
55
+
-[x] (R) Design details are appropriately documented
56
+
-[x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
58
57
-[ ] e2e Tests for all Beta API Operations (endpoints)
59
58
-[ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
60
59
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
61
-
-[] (R) Graduation criteria is in place
60
+
-[x] (R) Graduation criteria is in place
62
61
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
63
62
-[ ] (R) Production readiness review completed
64
63
-[ ] (R) Production readiness review approved
65
-
-[] "Implementation History" section is up-to-date for milestone
66
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
67
-
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
64
+
-[x] "Implementation History" section is up-to-date for milestone
65
+
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
66
+
-[x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
@@ -671,17 +670,15 @@ At beta we will revisit if e2e buckets are warranted for e2e node, and after gat
671
670
- Kubelet cache is duplicated in memory and on disk. This is more resilient to I/O
672
671
operation errors when reading from disk.
673
672
674
-
#### Deprecation
675
-
676
-
N/A in alpha
677
-
TBD subsequent to alpha
678
-
679
673
### Upgrade / Downgrade Strategy
680
674
681
675
### Version Skew Strategy
682
676
683
-
N/A for alpha
684
-
TBD subsequent to alpha
677
+
Version skew between the versions of the on-disk cache records should be handled
678
+
during initialization of the cache by performing a storage migration (e.g. migrating
679
+
the records from v1alpha1 to v1beta1).
680
+
681
+
The kubelet should be able to read cache records within the supported kubelet skew.
685
682
686
683
## Production Readiness Review Questionnaire
687
684
@@ -738,13 +735,32 @@ the behavior of the pull policies will revert to the previous behavior.
738
735
739
736
###### What specific metrics should inform a rollback?
740
737
741
-
If the feature gate is enabled, the kubelet will gather metrics `image_pull_secret_recheck_miss` and
742
-
`image_pull_secret_recheck_hit` which are both histograms counting the number of images that had a cache miss/hit.
743
-
744
-
This will allow an admin to see how many images have authorization checks done.
745
-
746
-
A histogram was chosen to allow an admin to compare registry uptime with cache misses, as the main failure scenerio is registry unavailability
747
-
could cause pods not to come up, because the kubelet doesn't have credentials cached.
738
+
Enabling the feature will make the kubelet expose several metrics:
739
+
- each caching layers should provide the following:
740
+
-`<cache_name>_<pullintents/pulledrecords>_total` gauge for the number of records (e.g. files, keys of a map)
741
+
that are kept in the cache
742
+
-`image_mustpull_checks_total` counter vector tracks how many times the check for
743
+
image pull credentials was called
744
+
- labels:
745
+
-`result`:
746
+
- "credentialPolicyAllowed" for cases where the kubelet's credential verification pull policy allows
747
+
access to the image (e.g. via allowlist or to an image pre-pulled to the node)
748
+
- "credentialRecordFound" when a matching credential record was found in the cache positively verifying access to an image
749
+
- "mustAuthenticate" when an additional check is required to verify access to the image, normally done by
750
+
authentication and verifying manifest access at a container registry for the image when the layers are already local
751
+
- "error" in error cases
752
+
- the `ensure_image_requests_total` counter vector tracks how many times EnsureImageExists()
753
+
was called
754
+
- labels:
755
+
-`pull_policy` - either of "never", "ifnotpresent" or "always", according to container image pull policy
756
+
-`present_locally` - "true" if the image is present on the node per container runtime, "false" if it isn't,
757
+
"unknown" if there was an error before this was determined
758
+
-`pull_required` - "true" if an image was requested to be pulled (e.g. from credential reverification
759
+
mechanism or by the ImagePullPolicy from the Pod), "false" if not,
760
+
"unknown" if there was an error before this was determined
761
+
762
+
This will allow an admin to see how many reverification checks are being requested for existing images and how
763
+
many requests make it all the way to the persistent cache.
748
764
749
765
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
750
766
@@ -758,37 +774,58 @@ No
758
774
759
775
###### How can an operator determine if the feature is in use by workloads?
760
776
761
-
When the feature is enabled, the kubelet will emit a metric `image_pull_secret_recheck_miss` and `image_pull_secret_recheck_hit` that will happen when a cache miss happens.
762
-
This will happen regardless of whether the feature is enabled in the kubelet via its configuration flag.
763
-
764
-
To determine if the feature is actually working, they will have to check manually.
777
+
When the feature is enabled, the kubelet will emit a metric named `image_mustpull_checks_total`.
765
778
766
-
A user could check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
767
-
using the pull if not present image pull policy.
779
+
Admins can also check the node's filesystem, where a directory `image_manager`with subdirectories
780
+
`pulling` and `pulled` should be present in the kubelet's main directory.
768
781
769
-
It also will show up as network events. Though only the manifests will be revalidated against the container image repository,
770
-
large contents will not be pulled. Thus one could monitor traffic to the registry.
782
+
If the feature was used by at least one workload that successfully started and is running a container with
783
+
a non-preloaded image (based on the policy), they should be able to find a file with a matching record of a
784
+
pulled image in the `<kubelet dir>/image_manager/pulled` directory at the node that
785
+
is running the workload's pod. The filename structure for these directories is described
786
+
in [Cache Directory Structure](#cache-directory-structure).
771
787
772
788
###### How can someone using this feature know that it is working for their instance?
773
789
774
-
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
775
-
where the image is present and the image pull policy is if not present.
790
+
Users are able to observe events in case workloads in their namespaces
791
+
were successfully able to retrieve an image that was previously pulled to a node.
776
792
777
793
-[x] Events
778
-
- Event Reason: "kubelet Failed to pull image" ... "unexpected status code [manifests ...]: 401 Unauthorized"
779
-
794
+
- Event message: "Container image ... already present on machine **and can be accessed by the pod**"
780
795
781
796
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
782
797
783
-
TBD
798
+
Since the kubelet does not provide image pull success rate at the time of writing,
799
+
the SLO for the feature should be based on the remote registries image pull success
800
+
rate.
801
+
802
+
The use of the feature may result in an increased rate of image pull requests
803
+
when compared to the old use of the "IfNotPresent" pod image pull policy. Given
804
+
how image pulling works, depending on the container registry authorization implementation,
805
+
this might mean an increase of 401 (Unauthorized), 403 (Forbidden) or 404 (NotFound) errors
806
+
but these should be directly proportional to the number of successful pulls.
807
+
808
+
A disproportionate increase in the number of unsuccessful pulls would suggest misuse
809
+
of pods' "IfNotPresent" image pull policy within the cluster.
784
810
785
811
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
786
812
787
-
TBD
813
+
These metrics might spike when the node is getting initialized and so
814
+
it makes sense to observe these in a later stage of the node's lifetime.
should usually be around 0, indicates the number of tracked in-flight image pulls.
819
+
- Metric name: if `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally="true",pull_required="true"}` (i.e. "reverifies requested") gets
820
+
too close to `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally!="unknown",pull_required!="unknown"}` (i.e. "all non-error image requests")
821
+
this might suggest either a) credentials are not very commonly shared within the cluster;
822
+
or b) there is a problem with the credential-tracking cache. To consider the values
823
+
too close, "reverifies requested" / "all non-error image requests" would be above 0.9.
824
+
- Components exposing the metric: kubelet
788
825
789
826
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
790
827
791
-
TBD needed for Beta
828
+
No.
792
829
793
830
### Dependencies
794
831
@@ -845,7 +882,8 @@ Reduce the number of cache misses (as seen through the metrics) by ensuring simi
845
882
846
883
## Implementation History
847
884
848
-
TBD
885
+
-[x] 2024-10-08 - Reworked version of the KEP merged, accepted as implementable
@@ -860,7 +898,9 @@ ensure the image instead of kubelet.
860
898
- For beta, we may want to consider deleting cached credentials upon Kubernetes secret / namespace deletion.
861
899
- Discussions went back and forth as to whether to persist the cache across reboots. It was decided to do so.
862
900
-`Never` could be always allowed to use an image on the node, regardless of its presence on the node. However, this would functionally disable this feature from a security standpoint.
901
+
- Consider incorporating the ":latest" or a missing tag as a special case of why an image was requested
902
+
to be pulled in the `kubelet_imagemanager_ensure_image_requests_total` metric.
0 commit comments