You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/2535-ensure-secret-pulled-images/README.md
+67-24Lines changed: 67 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -738,13 +738,32 @@ the behavior of the pull policies will revert to the previous behavior.
738
738
739
739
###### What specific metrics should inform a rollback?
740
740
741
-
If the feature gate is enabled, the kubelet will gather metrics `image_pull_secret_recheck_miss` and
742
-
`image_pull_secret_recheck_hit` which are both histograms counting the number of images that had a cache miss/hit.
743
-
744
-
This will allow an admin to see how many images have authorization checks done.
745
-
746
-
A histogram was chosen to allow an admin to compare registry uptime with cache misses, as the main failure scenerio is registry unavailability
747
-
could cause pods not to come up, because the kubelet doesn't have credentials cached.
741
+
Enabling the feature will make the kubelet expose several metrics:
742
+
- each caching layers should provide the following:
743
+
-`<cache_name>_<pullintents/pulledrecords>_total` gauge for the number of records (e.g. files, keys of a map)
744
+
that are kept in the cache
745
+
-`image_mustpull_checks_total` counter vector tracks how many times the check for
746
+
image pull credentials was called
747
+
- labels:
748
+
-`result`:
749
+
- "credentialPolicyAllowed" for cases where the kubelet's credential verification pull policy allows
750
+
access to the image (e.g. via allowlist or to an image pre-pulled to the node)
751
+
- "credentialRecordFound" when a matching credential record was found in the cache positively verifying access to an image
752
+
- "mustAuthenticate" when an additional check is required to verify access to the image, normally done by
753
+
authentication and verifying manifest access at a container registry for the image when the layers are already local
754
+
- "error" in error cases
755
+
- the `ensure_image_requests_total` counter vector tracks how many times EnsureImageExists()
756
+
was called
757
+
- labels:
758
+
-`pull_policy` - either of "never", "ifnotpresent" or "always", according to container image pull policy
759
+
-`present_locally` - "true" if the image is present on the node per container runtime, "false" if it isn't,
760
+
"unknown" if there was an error before this was determined
761
+
-`pull_required` - "true" if an image was requested to be pulled (e.g. from credential reverification
762
+
mechanism or by the ImagePullPolicy from the Pod), "false" if not,
763
+
"unknown" if there was an error before this was determined
764
+
765
+
This will allow an admin to see how many reverification checks are being requested for existing images and how
766
+
many requests make it all the way to the persistent cache.
748
767
749
768
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
750
769
@@ -758,37 +777,58 @@ No
758
777
759
778
###### How can an operator determine if the feature is in use by workloads?
760
779
761
-
When the feature is enabled, the kubelet will emit a metric `image_pull_secret_recheck_miss` and `image_pull_secret_recheck_hit` that will happen when a cache miss happens.
762
-
This will happen regardless of whether the feature is enabled in the kubelet via its configuration flag.
780
+
When the feature is enabled, the kubelet will emit a metric named `image_mustpull_checks_total`.
763
781
764
-
To determine if the feature is actually working, they will have to check manually.
782
+
Admins can also check the node's filesystem, where a directory `image_manager` with subdirectories
783
+
`pulling` and `pulled` should be present in the kubelet's main directory.
765
784
766
-
A user could check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
767
-
using the pull if not present image pull policy.
768
-
769
-
It also will show up as network events. Though only the manifests will be revalidated against the container image repository,
770
-
large contents will not be pulled. Thus one could monitor traffic to the registry.
785
+
If the feature was used by at least one workload that successfully started and is running a container with
786
+
a non-preloaded image (based on the policy), they should be able to find a file with a matching record of a
787
+
pulled image in the `<kubelet dir>/image_manager/pulled` directory at the node that
788
+
is running the workload's pod. The filename structure for these directories is described
789
+
in [Cache Directory Structure](#cache-directory-structure).
771
790
772
791
###### How can someone using this feature know that it is working for their instance?
773
792
774
-
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
775
-
where the image is present and the image pull policy is if not present.
793
+
Users are able to observe events in case workloads in their namespaces
794
+
were successfully able to retrieve an image that was previously pulled to a node.
776
795
777
796
-[x] Events
778
-
- Event Reason: "kubelet Failed to pull image" ... "unexpected status code [manifests ...]: 401 Unauthorized"
779
-
797
+
- Event message: "Container image ... already present on machine **and can be accessed by the pod**"
780
798
781
799
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
782
800
783
-
TBD
801
+
Since the kubelet does not provide image pull success rate at the time of writing,
802
+
the SLO for the feature should be based on the remote registries image pull success
803
+
rate.
804
+
805
+
The use of the feature may result in an increased rate of image pull requests
806
+
when compared to the old use of the "IfNotPresent" pod image pull policy. Given
807
+
how image pulling works, depending on the container registry authorization implementation,
808
+
this might mean an increase of 401 (Unauthorized), 403 (Forbidden) or 404 (NotFound) errors
809
+
but these should be directly proportional to the number of successful pulls.
810
+
811
+
A disproportionate increase in the number of unsuccessful pulls would suggest misuse
812
+
of pods' "IfNotPresent" image pull policy within the cluster.
784
813
785
814
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
786
815
787
-
TBD
816
+
These metrics might spike when the node is getting initialized and so
817
+
it makes sense to observe these in a later stage of the node's lifetime.
should usually be around 0, indicates the number of tracked in-flight image pulls.
822
+
- Metric name: if `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally="true",pull_required="true"}` (i.e. "reverifies requested") gets
823
+
too close to `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally!="unknown",pull_required!="unknown"}` (i.e. "all non-error image requests")
824
+
this might suggest either a) credentials are not very commonly shared within the cluster;
825
+
or b) there is a problem with the credential-tracking cache. To consider the values
826
+
too close, "reverifies requested" / "all non-error image requests" would be above 0.9.
827
+
- Components exposing the metric: kubelet
788
828
789
829
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
790
830
791
-
TBD needed for Beta
831
+
No.
792
832
793
833
### Dependencies
794
834
@@ -845,7 +885,8 @@ Reduce the number of cache misses (as seen through the metrics) by ensuring simi
845
885
846
886
## Implementation History
847
887
848
-
TBD
888
+
-[x] 2024-10-08 - Reworked version of the KEP merged, accepted as implementable
@@ -860,7 +901,9 @@ ensure the image instead of kubelet.
860
901
- For beta, we may want to consider deleting cached credentials upon Kubernetes secret / namespace deletion.
861
902
- Discussions went back and forth as to whether to persist the cache across reboots. It was decided to do so.
862
903
-`Never` could be always allowed to use an image on the node, regardless of its presence on the node. However, this would functionally disable this feature from a security standpoint.
904
+
- Consider incorporating the ":latest" or a missing tag as a special case of why an image was requested
905
+
to be pulled in the `kubelet_imagemanager_ensure_image_requests_total` metric.
0 commit comments