Skip to content

Commit 6193b90

Browse files
authored
Merge pull request #5374 from aramase/aramase/d/kep_4412_beta_v1.34
KEP-4412: promote wi for image pulls to beta in v1.34
2 parents 8dadcec + be7f7d1 commit 6193b90

File tree

3 files changed

+86
-39
lines changed
  • keps
    • prod-readiness/sig-auth
    • sig-auth/4412-projected-service-account-tokens-for-kubelet-image-credential-providers

3 files changed

+86
-39
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 4412
22
alpha:
33
approver: "deads2k"
4+
beta:
5+
approver: "deads2k"

keps/sig-auth/4412-projected-service-account-tokens-for-kubelet-image-credential-providers/README.md

Lines changed: 79 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,6 @@ tags, and then generate with `hack/update-toc.sh`.
104104
- [e2e tests](#e2e-tests)
105105
- [Graduation Criteria](#graduation-criteria)
106106
- [Alpha](#alpha)
107-
- [Post Alpha](#post-alpha)
108107
- [Beta](#beta)
109108
- [GA](#ga)
110109
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
@@ -141,10 +140,10 @@ checklist items _must_ be updated for the enhancement to be released.
141140

142141
Items marked with (R) are required *prior to targeting to a milestone / release*.
143142

144-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
145-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
146-
- [ ] (R) Design details are appropriately documented
147-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
143+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
144+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
145+
- [x] (R) Design details are appropriately documented
146+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
148147
- [ ] e2e Tests for all Beta API Operations (endpoints)
149148
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
150149
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
@@ -687,6 +686,13 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
687686
https://storage.googleapis.com/k8s-triage/index.html
688687
-->
689688

689+
This kubelet feature is fully tested with unit and e2e tests.
690+
691+
For the node audience restriction changes in KAS, integration tests were added as part of the [implementation in v1.32 release](https://github.com/kubernetes/kubernetes/pull/128077).
692+
693+
- [test/integration/auth/node_test.go](https://github.com/kubernetes/kubernetes/blob/master/test/integration/auth/node_test.go)
694+
- [triage history](https://storage.googleapis.com/k8s-triage/index.html?text=TestNodeRestrictionServiceAccountAudience&test=test%2Fintegration%2Fauth)
695+
690696
##### e2e tests
691697

692698
<!--
@@ -699,6 +705,18 @@ https://storage.googleapis.com/k8s-triage/index.html
699705
We expect no non-infra related flakes in the last month as a GA graduation criteria.
700706
-->
701707

708+
There is an existing e2e test for kubelet credential providers using gcp credential provider.
709+
710+
- test/e2e_node/image_credential_provider.go: https://testgrid.k8s.io/sig-node-kubelet#kubelet-credential-provider
711+
712+
As part of alpha implementation, the [e2e test has been updated](https://github.com/kubernetes/kubernetes/commit/2090a01e0a495301432276216bbf9af102fc431c) to cover the new credential provider configuration and the new behavior of the kubelet when the `TokenAttributes` field is set.
713+
714+
We created a symlink to the existing gcp credential provider executable with a different name to use for testing service account token for credential provider. The credential provider has been updated to validate the following when plugin is run in service account token mode:
715+
716+
1. Check the required annotations are sent as part of the `CredentialProviderRequest.ServiceAccountAnnotations` field.
717+
2. Check the service account token is sent as part of the `CredentialProviderRequest.ServiceAccountToken` field.
718+
3. Extract the claims from the service account token and validate the audience claim matches the `ServiceAccountTokenAudience` field in the kubelet's credential provider configuration.
719+
702720
### Graduation Criteria
703721

704722
<!--
@@ -773,15 +791,13 @@ in back-to-back releases.
773791
- `ServiceAccountNodeAudienceRestriction` feature gate implemented in KAS as a beta feature
774792
- Audience validation is enabled by default for service account tokens requested by the kubelet
775793

776-
#### Post Alpha
777-
778-
- Make sure the feature is compatible with the [Ensure secret pull images KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2535-ensure-secret-pulled-images).
779-
780794
#### Beta
781795

782-
- The implementation works well with the Ensure secret pull images KEP and supports pod image pull policy set to any value.
796+
- Make the feature compatible with the [Ensure secret pull images KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2535-ensure-secret-pulled-images).
783797
- `ServiceAccountNodeAudienceRestriction` feature gate is beta in KAS and enabled by default. This feature needs to be beta/enabled by default at least one release before this KEP goes to beta. This is critical to support downgrade use cases.
784-
- Add metrics
798+
- Caching KSA tokens per pod-sa to prevent generating tokens during hot loop/multiple containers with images.
799+
- Some indication of whether the credentials are SA or SA+pod-scoped
800+
- whether that's indicated in the config or in the plugin-returned content, and what the default is if unspecified (defaulting to pod is less performance, defaulting to SA risks incorrect cross-pod caching)
785801

786802
#### GA
787803

@@ -875,28 +891,22 @@ well as the [existing list] of feature gates.
875891
-->
876892

877893
- [x] Feature gate (also fill in values in `kep.yaml`)
878-
- Feature gate name: `ServiceAccountTokenForKubeletCredentialProviders`
894+
- Feature gate name: `KubeletServiceAccountTokenForCredentialProviders`
879895
- Components depending on the feature gate: kubelet
880896

881-
```go
882-
FeatureSpec{
883-
Default: false,
884-
LockToDefault: false,
885-
PreRelease: featuregate.Alpha,
886-
}
887-
```
888-
889897
- [x] Feature gate (also fill in values in `kep.yaml`)
890898
- Feature gate name: `ServiceAccountNodeAudienceRestriction`
891899
- Components depending on the feature gate: kube-apiserver
892900

893-
```go
894-
FeatureSpec{
895-
Default: true,
896-
LockToDefault: false,
897-
PreRelease: featuregate.Beta,
898-
}
899-
```
901+
The purpose of the two feature gates is different, which is why they weren't named similarly.
902+
903+
The `KubeletServiceAccountTokenForCredentialProviders` feature gate is used to enable the kubelet to use service account tokens for image pull in the kubelet credential provider.
904+
905+
The `ServiceAccountNodeAudienceRestriction` feature gate is used to enable the kube-apiserver to validate the audience of the service account token requested by the kubelet. The feature gate in the Kubernetes API Server (KAS) was introduced to strictly enforce which audiences the kubelet can request tokens for. Before this change, the kubelet could request a token with any audience. With the feature gate enabled, the API server starts validating the requested audience.
906+
907+
The KAS feature gate doesn't need to be enabled for the kubelet feature to work. It graduated to beta in v1.32 and is enabled by default. The two are unrelated in functionality, but the KAS gate was necessary to ensure strict enforcement of the allowed audiences the kubelet can request tokens for.
908+
909+
If the KAS feature gate is not enabled, there will be no validation of the audience requested by the kubelet, and the kubelet will be able to request tokens for any audience. This is not recommended.
900910

901911
###### Does enabling the feature change any default behavior?
902912

@@ -933,7 +943,8 @@ Steps to disable the feature:
933943
3. Restart the kubelet.
934944

935945
These steps need to be performed on all nodes in the cluster.
936-
After restarting the kubelet on all nodes, remove the audiences used by kubelet from the KAS `--allowed-kubelet-audiences` flag.
946+
After restarting the kubelet on all nodes, remove the allowed audiences for which the kubelet is allowed to generate service account tokens for image pulls in KAS by
947+
removing the previous `ClusterRole` or `Role` with the `request-serviceaccounts-token-audience` verb, along with the corresponding `ClusterRoleBinding` or `RoleBinding` that binds the role to the kubelet.
937948

938949
###### What happens if we reenable the feature if it was previously rolled back?
939950

@@ -974,13 +985,18 @@ rollout. Similarly, consider large clusters and how enablement/disablement
974985
will rollout across nodes.
975986
-->
976987

988+
Feature is enabled but exec plugin does not properly fetch and return credentials to the kubelet.
989+
Impact is that kubelet cannot authenticate and pull credentials from those registries.
990+
977991
###### What specific metrics should inform a rollback?
978992

979993
<!--
980994
What signals should users be paying attention to when the feature is young
981995
that might indicate a serious problem?
982996
-->
983997

998+
High error rates from `kubelet_credential_provider_plugin_error` and long durations from `kubelet_credential_provider_plugin_duration`.
999+
9841000
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
9851001

9861002
<!--
@@ -989,12 +1005,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
9891005
are missing a bunch of machinery and tooling and can't do that now.
9901006
-->
9911007

1008+
No, upgrade->downgrade->upgrade were not tested. Manual validation will be done prior to promoting this feature to beta in v1.34.
1009+
9921010
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
9931011

9941012
<!--
9951013
Even if applying deprecation policies, they may still surprise some users.
9961014
-->
9971015

1016+
No.
1017+
9981018
### Monitoring Requirements
9991019

10001020
<!--
@@ -1004,6 +1024,10 @@ For GA, this section is required: approvers should be able to confirm the
10041024
previous answers based on experience in the field.
10051025
-->
10061026

1027+
New metrics:
1028+
1029+
- `kubelet_credential_provider_config_hash` indicates the hash of the kubelet credential provider configuration file. This metric can be used by operators to determine if the kubelet credential provider configuration has changed.
1030+
10071031
###### How can an operator determine if the feature is in use by workloads?
10081032

10091033
<!--
@@ -1012,6 +1036,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
10121036
logs or events for this purpose.
10131037
-->
10141038

1039+
Operators can use `kubelet_credential_provider_config_hash` metric to determine if the kubelet credential provider configuration has changed. If the hash of the configuration file changes, it indicates that the kubelet credential provider configuration has been updated.
1040+
10151041
###### How can someone using this feature know that it is working for their instance?
10161042

10171043
<!--
@@ -1023,13 +1049,12 @@ and operation of this feature.
10231049
Recall that end users cannot usually observe component logs or access metrics.
10241050
-->
10251051

1026-
- [ ] Events
1027-
- Event Reason:
1028-
- [ ] API .status
1029-
- Condition name:
1030-
- Other field:
1031-
- [ ] Other (treat as last resort)
1032-
- Details:
1052+
Users can observe events for successful image pulls that use the service account token for image pull.
1053+
1054+
- [x] Events
1055+
- Event Reason: " Successfully pulled image "xxx" in 11.877s (11.877s including waiting). Image size: xxx bytes."
1056+
1057+
For registries or images configured to be pulled using a credential provider with a service account, a successful image pull seems to be the only way to confirm that it's working. If the credential provider is misbehaving, the kubelet will not be able to authenticate to the registry and pull images, which will result in image pull errors.
10331058

10341059
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
10351060

@@ -1048,6 +1073,15 @@ These goals will help you determine what you need to measure (SLIs) in the next
10481073
question.
10491074
-->
10501075

1076+
On failure to fetch credentials from an exec plugin, the kubelet will retry after some period and invoke the plugin again.
1077+
The kubelet will retry whenever it attempts to pull an image, but until then, kubelet will not be able to authenticate to
1078+
the registry and pull images. The SLO for successfully invoking exec plugins should be based on the SLO for successfully
1079+
pulling images for the container registry in question.
1080+
1081+
The SLOs defined in [Pod startup latency SLI/SLO details](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md)
1082+
don't apply to this feature because image pull SLI is explicitly excluded from the pod startup latency SLI/SLO. However, if the kubelet is unable to
1083+
pull images due to misconfiguration of the credential provider plugin, it will result in pod startup failures.
1084+
10511085
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
10521086

10531087
<!--
@@ -1093,6 +1127,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
10931127
- Impact of its degraded performance or high-error rates on the feature:
10941128
-->
10951129

1130+
This feature depends on the existence of a credential provider plugin binary on the host and a configuration file for the plugin to be read by the kubelet.
1131+
10961132
### Scalability
10971133

10981134
<!--
@@ -1222,6 +1258,8 @@ details). For now, we leave it here.
12221258

12231259
###### How does this feature react if the API server and/or etcd is unavailable?
12241260

1261+
If the API server is unavailable, kubelet will not be able to fetch service account tokens for image pull. The kubelet will retry fetching the token after some period, but until then, kubelet will not be able to authenticate to the registry and pull images that rely on the credential provider plugin using service account tokens for image pull.
1262+
12251263
###### What are other known failure modes?
12261264

12271265
<!--
@@ -1239,6 +1277,9 @@ For each of them, fill in the following information by copying the below templat
12391277

12401278
###### What steps should be taken if SLOs are not being met to determine the problem?
12411279

1280+
- check logs of kubelet
1281+
- check service availability of container registries used by the cluster
1282+
12421283
## Implementation History
12431284

12441285
<!--
@@ -1252,6 +1293,9 @@ Major milestones might include:
12521293
- when the KEP was retired or superseded
12531294
-->
12541295

1296+
1.33: Alpha release
1297+
1.34: Beta release
1298+
12551299
## Drawbacks
12561300

12571301
<!--

keps/sig-auth/4412-projected-service-account-tokens-for-kubelet-image-credential-providers/kep.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,18 @@ see-also:
1616
- "/keps/sig-node/2535-ensure-secret-pulled-images"
1717
creation-date: "2024-09-09"
1818
status: implementable
19-
stage: alpha
20-
latest-milestone: "v1.33"
19+
stage: beta
20+
latest-milestone: "v1.34"
2121
milestone:
2222
alpha: "v1.33"
23+
beta: "v1.34"
2324
feature-gates:
24-
- name: ServiceAccountTokenForKubeletCredentialProviders
25+
- name: KubeletServiceAccountTokenForCredentialProviders
2526
components:
2627
- kubelet
2728
- name: ServiceAccountNodeAudienceRestriction
2829
components:
2930
- kube-apiserver
3031
disable-supported: true
3132
metrics:
32-
- "TODO"
33+
- kubelet_credential_provider_config_hash

0 commit comments

Comments
 (0)