You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This kubelet feature is fully tested with unit and e2e tests.
690
+
691
+
For the node audience restriction changes in KAS, integration tests were added as part of the [implementation in v1.32 release](https://github.com/kubernetes/kubernetes/pull/128077).
As part of alpha implementation, the [e2e test has been updated](https://github.com/kubernetes/kubernetes/commit/2090a01e0a495301432276216bbf9af102fc431c) to cover the new credential provider configuration and the new behavior of the kubelet when the `TokenAttributes` field is set.
713
+
714
+
We created a symlink to the existing gcp credential provider executable with a different name to use for testing service account token for credential provider. The credential provider has been updated to validate the following when plugin is run in service account token mode:
715
+
716
+
1. Check the required annotations are sent as part of the `CredentialProviderRequest.ServiceAccountAnnotations` field.
717
+
2. Check the service account token is sent as part of the `CredentialProviderRequest.ServiceAccountToken` field.
718
+
3. Extract the claims from the service account token and validate the audience claim matches the `ServiceAccountTokenAudience` field in the kubelet's credential provider configuration.
719
+
702
720
### Graduation Criteria
703
721
704
722
<!--
@@ -773,15 +791,13 @@ in back-to-back releases.
773
791
- `ServiceAccountNodeAudienceRestriction`feature gate implemented in KAS as a beta feature
774
792
- Audience validation is enabled by default for service account tokens requested by the kubelet
775
793
776
-
#### Post Alpha
777
-
778
-
- Make sure the feature is compatible with the [Ensure secret pull images KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2535-ensure-secret-pulled-images).
779
-
780
794
#### Beta
781
795
782
-
- The implementation works well with the Ensure secret pull images KEP and supports pod image pull policy set to any value.
796
+
- Make the feature compatible with the [Ensure secret pull images KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2535-ensure-secret-pulled-images).
783
797
- `ServiceAccountNodeAudienceRestriction`feature gate is beta in KAS and enabled by default. This feature needs to be beta/enabled by default at least one release before this KEP goes to beta. This is critical to support downgrade use cases.
784
-
- Add metrics
798
+
- Caching KSA tokens per pod-sa to prevent generating tokens during hot loop/multiple containers with images.
799
+
- Some indication of whether the credentials are SA or SA+pod-scoped
800
+
- whether that's indicated in the config or in the plugin-returned content, and what the default is if unspecified (defaulting to pod is less performance, defaulting to SA risks incorrect cross-pod caching)
785
801
786
802
#### GA
787
803
@@ -875,28 +891,22 @@ well as the [existing list] of feature gates.
875
891
-->
876
892
877
893
- [x] Feature gate (also fill in values in `kep.yaml`)
- Components depending on the feature gate: kube-apiserver
892
900
893
-
```go
894
-
FeatureSpec{
895
-
Default: true,
896
-
LockToDefault: false,
897
-
PreRelease: featuregate.Beta,
898
-
}
899
-
```
901
+
The purpose of the two feature gates is different, which is why they weren't named similarly.
902
+
903
+
The `KubeletServiceAccountTokenForCredentialProviders` feature gate is used to enable the kubelet to use service account tokens for image pull in the kubelet credential provider.
904
+
905
+
The `ServiceAccountNodeAudienceRestriction` feature gate is used to enable the kube-apiserver to validate the audience of the service account token requested by the kubelet. The feature gate in the Kubernetes API Server (KAS) was introduced to strictly enforce which audiences the kubelet can request tokens for. Before this change, the kubelet could request a token with any audience. With the feature gate enabled, the API server starts validating the requested audience.
906
+
907
+
The KAS feature gate doesn't need to be enabled for the kubelet feature to work. It graduated to beta in v1.32 and is enabled by default. The two are unrelated in functionality, but the KAS gate was necessary to ensure strict enforcement of the allowed audiences the kubelet can request tokens for.
908
+
909
+
If the KAS feature gate is not enabled, there will be no validation of the audience requested by the kubelet, and the kubelet will be able to request tokens for any audience. This is not recommended.
900
910
901
911
###### Does enabling the feature change any default behavior?
902
912
@@ -933,7 +943,8 @@ Steps to disable the feature:
933
943
3. Restart the kubelet.
934
944
935
945
These steps need to be performed on all nodes in the cluster.
936
-
After restarting the kubelet on all nodes, remove the audiences used by kubelet from the KAS `--allowed-kubelet-audiences` flag.
946
+
After restarting the kubelet on all nodes, remove the allowed audiences for which the kubelet is allowed to generate service account tokens for image pulls in KAS by
947
+
removing the previous `ClusterRole` or `Role` with the `request-serviceaccounts-token-audience` verb, along with the corresponding `ClusterRoleBinding` or `RoleBinding` that binds the role to the kubelet.
937
948
938
949
###### What happens if we reenable the feature if it was previously rolled back?
939
950
@@ -974,13 +985,18 @@ rollout. Similarly, consider large clusters and how enablement/disablement
974
985
will rollout across nodes.
975
986
-->
976
987
988
+
Feature is enabled but exec plugin does not properly fetch and return credentials to the kubelet.
989
+
Impact is that kubelet cannot authenticate and pull credentials from those registries.
990
+
977
991
###### What specific metrics should inform a rollback?
978
992
979
993
<!--
980
994
What signals should users be paying attention to when the feature is young
981
995
that might indicate a serious problem?
982
996
-->
983
997
998
+
High error rates from `kubelet_credential_provider_plugin_error` and long durations from `kubelet_credential_provider_plugin_duration`.
999
+
984
1000
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
985
1001
986
1002
<!--
@@ -989,12 +1005,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
989
1005
are missing a bunch of machinery and tooling and can't do that now.
990
1006
-->
991
1007
1008
+
No, upgrade->downgrade->upgrade were not tested. Manual validation will be done prior to promoting this feature to beta in v1.34.
1009
+
992
1010
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
993
1011
994
1012
<!--
995
1013
Even if applying deprecation policies, they may still surprise some users.
996
1014
-->
997
1015
1016
+
No.
1017
+
998
1018
### Monitoring Requirements
999
1019
1000
1020
<!--
@@ -1004,6 +1024,10 @@ For GA, this section is required: approvers should be able to confirm the
1004
1024
previous answers based on experience in the field.
1005
1025
-->
1006
1026
1027
+
New metrics:
1028
+
1029
+
- `kubelet_credential_provider_config_hash`indicates the hash of the kubelet credential provider configuration file. This metric can be used by operators to determine if the kubelet credential provider configuration has changed.
1030
+
1007
1031
###### How can an operator determine if the feature is in use by workloads?
1008
1032
1009
1033
<!--
@@ -1012,6 +1036,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
1012
1036
logs or events for this purpose.
1013
1037
-->
1014
1038
1039
+
Operators can use `kubelet_credential_provider_config_hash` metric to determine if the kubelet credential provider configuration has changed. If the hash of the configuration file changes, it indicates that the kubelet credential provider configuration has been updated.
1040
+
1015
1041
###### How can someone using this feature know that it is working for their instance?
1016
1042
1017
1043
<!--
@@ -1023,13 +1049,12 @@ and operation of this feature.
1023
1049
Recall that end users cannot usually observe component logs or access metrics.
1024
1050
-->
1025
1051
1026
-
- [ ] Events
1027
-
- Event Reason:
1028
-
- [ ] API .status
1029
-
- Condition name:
1030
-
- Other field:
1031
-
- [ ] Other (treat as last resort)
1032
-
- Details:
1052
+
Users can observe events for successful image pulls that use the service account token for image pull.
1053
+
1054
+
- [x] Events
1055
+
- Event Reason: " Successfully pulled image "xxx" in 11.877s (11.877s including waiting). Image size: xxx bytes."
1056
+
1057
+
For registries or images configured to be pulled using a credential provider with a service account, a successful image pull seems to be the only way to confirm that it's working. If the credential provider is misbehaving, the kubelet will not be able to authenticate to the registry and pull images, which will result in image pull errors.
1033
1058
1034
1059
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
1035
1060
@@ -1048,6 +1073,15 @@ These goals will help you determine what you need to measure (SLIs) in the next
1048
1073
question.
1049
1074
-->
1050
1075
1076
+
On failure to fetch credentials from an exec plugin, the kubelet will retry after some period and invoke the plugin again.
1077
+
The kubelet will retry whenever it attempts to pull an image, but until then, kubelet will not be able to authenticate to
1078
+
the registry and pull images. The SLO for successfully invoking exec plugins should be based on the SLO for successfully
1079
+
pulling images for the container registry in question.
1080
+
1081
+
The SLOs defined in [Pod startup latency SLI/SLO details](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md)
1082
+
don't apply to this feature because image pull SLI is explicitly excluded from the pod startup latency SLI/SLO. However, if the kubelet is unable to
1083
+
pull images due to misconfiguration of the credential provider plugin, it will result in pod startup failures.
1084
+
1051
1085
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
1052
1086
1053
1087
<!--
@@ -1093,6 +1127,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
1093
1127
- Impact of its degraded performance or high-error rates on the feature:
1094
1128
-->
1095
1129
1130
+
This feature depends on the existence of a credential provider plugin binary on the host and a configuration file for the plugin to be read by the kubelet.
1131
+
1096
1132
### Scalability
1097
1133
1098
1134
<!--
@@ -1222,6 +1258,8 @@ details). For now, we leave it here.
1222
1258
1223
1259
###### How does this feature react if the API server and/or etcd is unavailable?
1224
1260
1261
+
If the API server is unavailable, kubelet will not be able to fetch service account tokens for image pull. The kubelet will retry fetching the token after some period, but until then, kubelet will not be able to authenticate to the registry and pull images that rely on the credential provider plugin using service account tokens for image pull.
1262
+
1225
1263
###### What are other known failure modes?
1226
1264
1227
1265
<!--
@@ -1239,6 +1277,9 @@ For each of them, fill in the following information by copying the below templat
1239
1277
1240
1278
###### What steps should be taken if SLOs are not being met to determine the problem?
1241
1279
1280
+
- check logs of kubelet
1281
+
- check service availability of container registries used by the cluster
1282
+
1242
1283
## Implementation History
1243
1284
1244
1285
<!--
@@ -1252,6 +1293,9 @@ Major milestones might include:
0 commit comments