Skip to content

Commit e740dad

Browse files
authored
Merge pull request #5371 from stlaz/ensure-metrics
KEP-2535: (Ensure Secret Pulled Images): add metrics, bump to beta in 1.34
2 parents 849e2af + 38bf786 commit e740dad

File tree

3 files changed

+93
-42
lines changed

3 files changed

+93
-42
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2535
22
alpha:
33
approver: "@jpbetz"
4+
beta:
5+
approver: "@jpbetz"

keps/sig-node/2535-ensure-secret-pulled-images/README.md

Lines changed: 80 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@
3131
- [Graduation Criteria](#graduation-criteria)
3232
- [Alpha](#alpha)
3333
- [Beta](#beta)
34-
- [Deprecation](#deprecation)
3534
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
3635
- [Version Skew Strategy](#version-skew-strategy)
3736
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -51,20 +50,20 @@
5150

5251
Items marked with (R) are required *prior to targeting to a milestone / release*.
5352

54-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
55-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
56-
- [ ] (R) Design details are appropriately documented
57-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
53+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
54+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
55+
- [x] (R) Design details are appropriately documented
56+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
5857
- [ ] e2e Tests for all Beta API Operations (endpoints)
5958
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
6059
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
61-
- [ ] (R) Graduation criteria is in place
60+
- [x] (R) Graduation criteria is in place
6261
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
6362
- [ ] (R) Production readiness review completed
6463
- [ ] (R) Production readiness review approved
65-
- [ ] "Implementation History" section is up-to-date for milestone
66-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
67-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
64+
- [x] "Implementation History" section is up-to-date for milestone
65+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
66+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6867

6968
[kubernetes.io]: https://kubernetes.io/
7069
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -671,17 +670,15 @@ At beta we will revisit if e2e buckets are warranted for e2e node, and after gat
671670
- Kubelet cache is duplicated in memory and on disk. This is more resilient to I/O
672671
operation errors when reading from disk.
673672

674-
#### Deprecation
675-
676-
N/A in alpha
677-
TBD subsequent to alpha
678-
679673
### Upgrade / Downgrade Strategy
680674

681675
### Version Skew Strategy
682676

683-
N/A for alpha
684-
TBD subsequent to alpha
677+
Version skew between the versions of the on-disk cache records should be handled
678+
during initialization of the cache by performing a storage migration (e.g. migrating
679+
the records from v1alpha1 to v1beta1).
680+
681+
The kubelet should be able to read cache records within the supported kubelet skew.
685682

686683
## Production Readiness Review Questionnaire
687684

@@ -738,13 +735,32 @@ the behavior of the pull policies will revert to the previous behavior.
738735

739736
###### What specific metrics should inform a rollback?
740737

741-
If the feature gate is enabled, the kubelet will gather metrics `image_pull_secret_recheck_miss` and
742-
`image_pull_secret_recheck_hit` which are both histograms counting the number of images that had a cache miss/hit.
743-
744-
This will allow an admin to see how many images have authorization checks done.
745-
746-
A histogram was chosen to allow an admin to compare registry uptime with cache misses, as the main failure scenerio is registry unavailability
747-
could cause pods not to come up, because the kubelet doesn't have credentials cached.
738+
Enabling the feature will make the kubelet expose several metrics:
739+
- each caching layers should provide the following:
740+
- `<cache_name>_<pullintents/pulledrecords>_total` gauge for the number of records (e.g. files, keys of a map)
741+
that are kept in the cache
742+
- `image_mustpull_checks_total` counter vector tracks how many times the check for
743+
image pull credentials was called
744+
- labels:
745+
- `result`:
746+
- "credentialPolicyAllowed" for cases where the kubelet's credential verification pull policy allows
747+
access to the image (e.g. via allowlist or to an image pre-pulled to the node)
748+
- "credentialRecordFound" when a matching credential record was found in the cache positively verifying access to an image
749+
- "mustAuthenticate" when an additional check is required to verify access to the image, normally done by
750+
authentication and verifying manifest access at a container registry for the image when the layers are already local
751+
- "error" in error cases
752+
- the `ensure_image_requests_total` counter vector tracks how many times EnsureImageExists()
753+
was called
754+
- labels:
755+
- `pull_policy` - either of "never", "ifnotpresent" or "always", according to container image pull policy
756+
- `present_locally` - "true" if the image is present on the node per container runtime, "false" if it isn't,
757+
"unknown" if there was an error before this was determined
758+
- `pull_required` - "true" if an image was requested to be pulled (e.g. from credential reverification
759+
mechanism or by the ImagePullPolicy from the Pod), "false" if not,
760+
"unknown" if there was an error before this was determined
761+
762+
This will allow an admin to see how many reverification checks are being requested for existing images and how
763+
many requests make it all the way to the persistent cache.
748764

749765
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
750766

@@ -758,37 +774,58 @@ No
758774

759775
###### How can an operator determine if the feature is in use by workloads?
760776

761-
When the feature is enabled, the kubelet will emit a metric `image_pull_secret_recheck_miss` and `image_pull_secret_recheck_hit` that will happen when a cache miss happens.
762-
This will happen regardless of whether the feature is enabled in the kubelet via its configuration flag.
763-
764-
To determine if the feature is actually working, they will have to check manually.
777+
When the feature is enabled, the kubelet will emit a metric named `image_mustpull_checks_total`.
765778

766-
A user could check if images pulled with credentials by a first pod, are also pulled with credentials by a second pod that is
767-
using the pull if not present image pull policy.
779+
Admins can also check the node's filesystem, where a directory `image_manager` with subdirectories
780+
`pulling` and `pulled` should be present in the kubelet's main directory.
768781

769-
It also will show up as network events. Though only the manifests will be revalidated against the container image repository,
770-
large contents will not be pulled. Thus one could monitor traffic to the registry.
782+
If the feature was used by at least one workload that successfully started and is running a container with
783+
a non-preloaded image (based on the policy), they should be able to find a file with a matching record of a
784+
pulled image in the `<kubelet dir>/image_manager/pulled` directory at the node that
785+
is running the workload's pod. The filename structure for these directories is described
786+
in [Cache Directory Structure](#cache-directory-structure).
771787

772788
###### How can someone using this feature know that it is working for their instance?
773789

774-
Can test for an image pull failure event coming from a second pod that does not have credentials to pull the image
775-
where the image is present and the image pull policy is if not present.
790+
Users are able to observe events in case workloads in their namespaces
791+
were successfully able to retrieve an image that was previously pulled to a node.
776792

777793
- [x] Events
778-
- Event Reason: "kubelet Failed to pull image" ... "unexpected status code [manifests ...]: 401 Unauthorized"
779-
794+
- Event message: "Container image ... already present on machine **and can be accessed by the pod**"
780795

781796
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
782797

783-
TBD
798+
Since the kubelet does not provide image pull success rate at the time of writing,
799+
the SLO for the feature should be based on the remote registries image pull success
800+
rate.
801+
802+
The use of the feature may result in an increased rate of image pull requests
803+
when compared to the old use of the "IfNotPresent" pod image pull policy. Given
804+
how image pulling works, depending on the container registry authorization implementation,
805+
this might mean an increase of 401 (Unauthorized), 403 (Forbidden) or 404 (NotFound) errors
806+
but these should be directly proportional to the number of successful pulls.
807+
808+
A disproportionate increase in the number of unsuccessful pulls would suggest misuse
809+
of pods' "IfNotPresent" image pull policy within the cluster.
784810

785811
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
786812

787-
TBD
813+
These metrics might spike when the node is getting initialized and so
814+
it makes sense to observe these in a later stage of the node's lifetime.
815+
816+
- [x] Metrics
817+
- Metric name: `kubelet_imagemanager_ondisk_pullintents_total`, `kubelet_imagemanager_inmemory_pullintents_total`
818+
should usually be around 0, indicates the number of tracked in-flight image pulls.
819+
- Metric name: if `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally="true",pull_required="true"}` (i.e. "reverifies requested") gets
820+
too close to `kubelet_imagemanager_ensure_image_requests_total{pull_policy!="always", present_locally!="unknown",pull_required!="unknown"}` (i.e. "all non-error image requests")
821+
this might suggest either a) credentials are not very commonly shared within the cluster;
822+
or b) there is a problem with the credential-tracking cache. To consider the values
823+
too close, "reverifies requested" / "all non-error image requests" would be above 0.9.
824+
- Components exposing the metric: kubelet
788825

789826
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
790827

791-
TBD needed for Beta
828+
No.
792829

793830
### Dependencies
794831

@@ -845,7 +882,8 @@ Reduce the number of cache misses (as seen through the metrics) by ensuring simi
845882

846883
## Implementation History
847884

848-
TBD
885+
- [x] 2024-10-08 - Reworked version of the KEP merged, accepted as implementable
886+
- [x] 2025-03-17 - Alpha implementation merged - https://github.com/kubernetes/kubernetes/pull/128152
849887

850888
## Drawbacks [optional]
851889

@@ -860,7 +898,9 @@ ensure the image instead of kubelet.
860898
- For beta, we may want to consider deleting cached credentials upon Kubernetes secret / namespace deletion.
861899
- Discussions went back and forth as to whether to persist the cache across reboots. It was decided to do so.
862900
- `Never` could be always allowed to use an image on the node, regardless of its presence on the node. However, this would functionally disable this feature from a security standpoint.
901+
- Consider incorporating the ":latest" or a missing tag as a special case of why an image was requested
902+
to be pulled in the `kubelet_imagemanager_ensure_image_requests_total` metric.
863903

864904
## Infrastructure Needed [optional]
865905

866-
TBD
906+
--

keps/sig-node/2535-ensure-secret-pulled-images/kep.yaml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ authors:
66
- "@haircommander"
77
- "@stlaz"
88
owning-sig: sig-node
9+
participating-sigs:
10+
- sig-auth
911
status: implementable
1012
creation-date: 2021-03-10
1113
reviewers:
@@ -17,13 +19,20 @@ reviewers:
1719
approvers:
1820
- "@dchen1107"
1921
- "@derekwaynecarr"
20-
stage: alpha
21-
latest-milestone: "v1.33"
22+
stage: beta
23+
latest-milestone: "v1.34"
2224
milestone:
2325
alpha: "v1.33"
26+
beta: "v1.34"
2427
feature-gates:
2528
- name: KubeletEnsureSecretPulledImages
2629
components:
2730
- kubelet
2831
disable-supported: true
2932
metrics:
33+
- kubelet_imagemanager_ondisk_pullintents_total
34+
- kubelet_imagemanager_ondisk_pulledrecords_total
35+
- kubelet_imagemanager_inmemory_pullintents_total
36+
- kubelet_imagemanager_inmemory_pulledrecords_total
37+
- kubelet_imagemanager_ensure_image_requests_total{pull_policy,present_locally,pull_required}
38+
- kubelet_imagemanager_image_mustpull_checks_total{result}

0 commit comments

Comments
 (0)