Skip to content

Commit c8bb504

Browse files
committed
kep-2535: discuss metrics and PRR topics
Signed-off-by: Stanislav Láznička <[email protected]>
1 parent b0de622 commit c8bb504

File tree

2 files changed

+73
-24
lines changed

2 files changed

+73
-24
lines changed

keps/sig-node/2535-ensure-secret-pulled-images/README.md

Lines changed: 67 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -738,13 +738,32 @@ the behavior of the pull policies will revert to the previous behavior.
738738

739739
###### What specific metrics should inform a rollback?
740740

741-
If the feature gate is enabled, the kubelet will gather metrics `image_pull_secret_recheck_miss` and
742-
`image_pull_secret_recheck_hit` which are both histograms counting the number of images that had a cache miss/hit.
743-
744-
This will allow an admin to see how many images have authorization checks done.
745-
746-
A histogram was chosen to allow an admin to compare registry uptime with cache misses, as the main failure scenerio is registry unavailability
747-
could cause pods not to come up, because the kubelet doesn't have credentials cached.
741+
Enabling the feature will make the kubelet expose several metrics:
742+
- each caching layers should provide the following:
743+
- `<cache_name>_<pullintents/pulledrecords>_total` gauge for the number of records (e.g. files, keys of a map)
744+
that are kept in the cache
745+
- `image_mustpull_checks_total` counter vector tracks how many times the check for
746+
image pull credentials was called
747+
- labels:
748+
- `result`:
749+
- "credentialPolicyAllowed" for cases where the kubelet's credential verification pull policy allows
750+
access to the image (e.g. via allowlist or to an image pre-pulled to the node)
751+
- "credentialRecordFound" when a matching credential record was found in the cache positively verifying access to an image
752+
- "mustAuthenticate" when an additional check is required to verify access to the image, normally done by
753+
authentication and verifying manifest access at a container registry for the image when the layers are already local
754+
- "error" in error cases
755+
- the `ensure_image_requests_total` counter vector tracks how many times EnsureImageExists()
756+
was called
757+
- labels:
758+
- `pull_policy` - either of "never", "ifnotpresent" or "always", according to container image pull policy
759+
- `present_locally` - "true" if the image is present on the node per container runtime, "false" if it isn't,
760+
"unknown" if there was an error before this was determined
761+
- `pull_required` - "true" if an image was requested to be pulled (e.g. from credential reverification
762+
mechanism or by the ImagePullPolicy from the Pod), "false" if not,
763+
"unknown" if there was an error before this was determined
764+
765+
This will allow an admin to see how many reverification checks are being requested for existing images and how
766+
many requests make it all the way to the persistent cache.
748767

749768
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
750769

@@ -758,37 +777,58 @@ No
758777

759778
###### How can an operator determine if the feature is in use by workloads?
760779

761-
When the feature is enabled, the kubelet will emit a metric `image_pull_secret_recheck_miss` and `image_pull_secret_recheck_hit` that will happen when a cache miss happens.
762-
This will happen regardless of whether the feature is enabled in the kubelet via its configuration flag.
780+
When the feature is enabled, the kubelet will emit a metric named `image_mustpull_checks_total`.
763781

764-
To determine if the feature is actually working, they will have to check manually.
782+
Admins can also check the node's filesystem, where a directory `image_manager` with subdirectories
783+
`pulling` and `pulled` should be present in the kubelet's main directory.
765784

766-
A user could check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
767-
using the pull if not present image pull policy.
768-
769-
It also will show up as network events. Though only the manifests will be revalidated against the container image repository,
770-
large contents will not be pulled. Thus one could monitor traffic to the registry.
785+
If the feature was used by at least one workload that successfully started and is running a container with
786+
a non-preloaded image (based on the policy), they should be able to find a file with a matching record of a
787+
pulled image in the `<kubelet dir>/image_manager/pulled` directory at the node that
788+
is running the workload's pod. The filename structure for these directories is described
789+
in [Cache Directory Structure](#cache-directory-structure).
771790

772791
###### How can someone using this feature know that it is working for their instance?
773792

774-
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
775-
where the image is present and the image pull policy is if not present.
793+
Users are able to observe events in case workloads in their namespaces
794+
were successfully able to retrieve an image that was previously pulled to a node.
776795

777796
- [x] Events
778-
- Event Reason: "kubelet Failed to pull image" ... "unexpected status code [manifests ...]: 401 Unauthorized"
779-
797+
- Event message: "Container image ... already present on machine **and can be accessed by the pod**"
780798

781799
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
782800

783-
TBD
801+
Since the kubelet does not provide image pull success rate at the time of writing,
802+
the SLO for the feature should be based on the remote registries image pull success
803+
rate.
804+
805+
The use of the feature may result in an increased rate of image pull requests
806+
when compared to the old use of the "IfNotPresent" pod image pull policy. Given
807+
how image pulling works, depending on the container registry authorization implementation,
808+
this might mean an increase of 401 (Unauthorized), 403 (Forbidden) or 404 (NotFound) errors
809+
but these should be directly proportional to the number of successful pulls.
810+
811+
A disproportionate increase in the number of unsuccessful pulls would suggest misuse
812+
of pods' "IfNotPresent" image pull policy within the cluster.
784813

785814
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
786815

787-
TBD
816+
These metrics might spike when the node is getting initialized and so
817+
it makes sense to observe these in a later stage of the node's lifetime.
818+
819+
- [x] Metrics
820+
- Metric name: `kubelet_imagemanager_ondisk_pullintents_total`, `kubelet_imagemanager_inmemory_pullintents_total`
821+
should usually be around 0, indicates the number of tracked in-flight image pulls.
822+
- Metric name: if `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally="true",pull_required="true"}` (i.e. "reverifies requested") gets
823+
too close to `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally!="unknown",pull_required!="unknown"}` (i.e. "all non-error image requests")
824+
this might suggest either a) credentials are not very commonly shared within the cluster;
825+
or b) there is a problem with the credential-tracking cache. To consider the values
826+
too close, "reverifies requested" / "all non-error image requests" would be above 0.9.
827+
- Components exposing the metric: kubelet
788828

789829
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
790830

791-
TBD needed for Beta
831+
No.
792832

793833
### Dependencies
794834

@@ -845,7 +885,8 @@ Reduce the number of cache misses (as seen through the metrics) by ensuring simi
845885

846886
## Implementation History
847887

848-
TBD
888+
- [x] 2024-10-08 - Reworked version of the KEP merged, accepted as implementable
889+
- [x] 2025-03-17 - Alpha implementation merged - https://github.com/kubernetes/kubernetes/pull/128152
849890

850891
## Drawbacks [optional]
851892

@@ -860,7 +901,9 @@ ensure the image instead of kubelet.
860901
- For beta, we may want to consider deleting cached credentials upon Kubernetes secret / namespace deletion.
861902
- Discussions went back and forth as to whether to persist the cache across reboots. It was decided to do so.
862903
- `Never` could be always allowed to use an image on the node, regardless of its presence on the node. However, this would functionally disable this feature from a security standpoint.
904+
- Consider incorporating the ":latest" or a missing tag as a special case of why an image was requested
905+
to be pulled in the `kubelet_imagemanager_ensure_image_requests_total` metric.
863906

864907
## Infrastructure Needed [optional]
865908

866-
TBD
909+
--

keps/sig-node/2535-ensure-secret-pulled-images/kep.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,9 @@ feature-gates:
2727
- kubelet
2828
disable-supported: true
2929
metrics:
30+
- kubelet_imagemanager_ondisk_pullintents_total
31+
- kubelet_imagemanager_ondisk_pulledrecords_total
32+
- kubelet_imagemanager_inmemory_pullintents_total
33+
- kubelet_imagemanager_inmemory_pulledrecords_total
34+
- kubelet_imagemanager_ensure_image_requests_total{pull_policy,present_locally,pull_required}
35+
- kubelet_imagemanager_image_mustpull_checks_total{result}

0 commit comments

Comments
 (0)