Skip to content

Commit 4b252b5

Browse files
committed
add missing PRR sections
1 parent aacf9db commit 4b252b5

File tree

3 files changed

+144
-4
lines changed

3 files changed

+144
-4
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 1205
2+
beta:
3+
approver: "@deads2k"

keps/sig-auth/1205-bound-service-account-tokens/README.md

Lines changed: 140 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,11 @@
3535
- [Beta -> GA Graduation](#beta---ga-graduation)
3636
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
3737
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
38+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
39+
- [Monitoring Requirements](#monitoring-requirements)
40+
- [Dependencies](#dependencies)
3841
- [Scalability](#scalability)
42+
- [Troubleshooting](#troubleshooting)
3943
<!-- /toc -->
4044

4145
## Summary
@@ -435,15 +439,16 @@ are properly reloading tokens by:
435439
https://github.com/kubernetes/kubernetes/issues/68164
436440
- [x] Pods running as non root may not access the service account token.
437441
- Fixed in https://github.com/kubernetes/kubernetes/pull/89193
442+
- [ ] Dynamic clientbuilder does not invalidate token.
438443

439-
- [x] Tests passing
444+
* [x] Tests passing
440445

441446
- [x] Upgrade test
442447
[sig-auth-serviceaccount-admission-controller-migration](https://k8s-testgrid.appspot.com/sig-auth-gce#upgrade-tests)
443448

444-
- [x] TokenRequest/TokenRequestProjection GA
449+
* [x] TokenRequest/TokenRequestProjection GA
445450

446-
- [x] RootCAConfigMap GA
451+
* [x] RootCAConfigMap GA
447452

448453
##### Beta -> GA Graduation
449454

@@ -482,6 +487,79 @@ are properly reloading tokens by:
482487
- upgrade test:
483488
test/e2e/upgrades/serviceaccount_admission_controller_migration.go
484489

490+
### Rollout, Upgrade and Rollback Planning
491+
492+
- **How can a rollout fail? Can it impact already running workloads?**
493+
494+
1. creation of CA configmap can fail due to permission / quota / admission
495+
errors.
496+
2. newly issued tokens could fail to be recognized by skewed API servers
497+
not configured with the bound token signing key/issuer.
498+
499+
- **What specific metrics should inform a rollback?**
500+
501+
1. creation of CA configmap,
502+
- `root_ca_cert_publisher_rate_limiter_use`
503+
2. authentication errors in (n-1) API servers,
504+
- `authentication_attempts`
505+
- `authentication_duration_seconds`
506+
507+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path
508+
tested?**
509+
for upgrade, we have set up e2e test running here:
510+
https://k8s-testgrid.appspot.com/sig-auth-gce#upgrade-tests&width=5
511+
512+
for downgrade, we have manually tested where a workload continues to
513+
authenticate successfully.
514+
515+
- **Is the rollout accompanied by any deprecations and/or removals of
516+
features, APIs, fields of API types, flags, etc.?** no
517+
518+
### Monitoring Requirements
519+
520+
- **How can an operator determine if the feature is in use by workloads?**
521+
522+
Check TokenRequest in use:
523+
524+
- `serviceaccount_valid_tokens_total`: cumulative valid projected service
525+
account tokens used
526+
- `serviceaccount_stale_tokens_total`: cumulative stale projected service
527+
account tokens used
528+
- `apiserver_request_total`: with labels `group="",version="v1",resource="serviceaccounts",subresource="token"`
529+
- `apiserver_request_duration_seconds`: with labels `group="",version="v1",resource="serviceaccounts",subresource="token"`
530+
531+
- **What are the SLIs (Service Level Indicators) an operator can use to
532+
determine the health of the service?**
533+
534+
- [x] Metrics
535+
- Metric name: apiserver_request_total
536+
- Aggregation method: group="",version="v1",resource="serviceaccounts",subresource="token"
537+
- Components exposing the metric: kube-apiserver
538+
539+
- **What are the reasonable SLOs (Service Level Objectives) for the above
540+
SLIs?**
541+
542+
- per-day percentage of API calls finishing with 5XX errors <= 1%
543+
544+
- **Are there any missing metrics that would be useful to have to improve
545+
observability of this feature?**
546+
547+
- add granularity to `storage_operation_duration_seconds` to distinguish
548+
projected volumes: configmap, secret, token,..etc... or add new metrics
549+
so that we can know the usage of projected tokens.
550+
551+
### Dependencies
552+
553+
- **Does this feature depend on any specific services running in the
554+
cluster?** There are no new components required, but specific versions of
555+
kubelet and kube-controller-manager are required
556+
557+
TokenRequest depends on kubelets >= 1.12
558+
559+
BoundServiceAccountTokenVolume depends on kubelets >= 1.12 with TokenRequest
560+
enabled (default since 1.12) and kube-controller-manager >= 1.12 with
561+
RootCAConfigMap feature enabled (default since 1.20)
562+
485563
### Scalability
486564

487565
- **Will enabling / using this feature result in any new API calls?**
@@ -511,3 +589,62 @@ are properly reloading tokens by:
511589
- **Will enabling / using this feature result in non-negligible increase of
512590
resource usage (CPU, RAM, disk, IO, ...) in any components?** it adds a
513591
token minting operation in the API server every ~48 minutes for every pod.
592+
593+
### Troubleshooting
594+
595+
The Troubleshooting section currently serves the `Playbook` role. We may
596+
consider splitting it into a dedicated `Playbook` document (potentially with
597+
some monitoring details). For now, we leave it here.
598+
599+
- **How does this feature react if the API server and/or etcd is
600+
unavailable?**
601+
602+
- TokenRequest API is unavailable
603+
- configmap containing API server CA bundle cannot be created or fetched
604+
605+
* **What are other known failure modes?**
606+
607+
- failure to issue token via token subresource
608+
609+
- Detection: check `apiserver_request_total` with labels
610+
`group="",version="v1",resource="serviceaccounts",subresource="token"`
611+
- Mitigations: disable the BoundServiceAccountTokenVolume feature gate in
612+
the kube-apiserver and recreate pods.
613+
- Diagnostics: "failed to generate token" in kube-apiserver log.
614+
- Testing: [e2e test](https://k8s-testgrid.appspot.com/sig-auth-gce#gce&width=5&include-filter-by-regex=ServiceAccounts%20should%20mount%20projected%20service%20account%20token)
615+
616+
- failure to create root CA config map
617+
618+
- Detection: check `root_ca_cert_publisher_sync_total` from
619+
kube-controller-manager. (available in 1.21+)
620+
- Mitigations: disable the BoundServiceAccountTokenVolume feature gate in
621+
the kube-apiserver and recreate pods.
622+
- Diagnostics: "syncing [namespace]/[configmap name] failed" in
623+
kube-controller-manager log.
624+
- Testing: [e2e test](https://k8s-testgrid.appspot.com/sig-auth-gce#gce&width=5&include-filter-by-regex=ServiceAccounts%20should%20guarantee%20kube-root-ca.crt%20exist%20in%20any%20namespace)
625+
626+
- kubelet fails to renew token
627+
628+
- Detection: check `apiserver_request_total` with labels
629+
`group="",version="v1",resource="serviceaccounts",subresource="token"` to
630+
see if failed in requesting a new token; check kubelet log.
631+
- Mitigations: disable the BoundServiceAccountTokenVolume feature gate in
632+
the kube-apiserver and recreate pods.
633+
- Diagnostics: "token [namespace]/[token name] expired and refresh failed"
634+
in kubelet log.
635+
- Testing: [e2e test](https://k8s-testgrid.appspot.com/sig-auth-gce#gce-slow&width=5)
636+
637+
- workload fails to refresh token from disk
638+
639+
- Detection: `serviceaccount_stale_tokens_total` emitted by kube-apiserver
640+
- Mitigations: update client library to newer version.
641+
- Diagnostics: look for `authentication.k8s.io/stale-token` in audit log if
642+
`--service-account-extend-token-expiration=true`, or check authentication
643+
error in kube-apiserver log.
644+
- Testing: covered in all client libraries' unittests.
645+
646+
* **What steps should be taken if SLOs are not being met to determine the
647+
problem?** Check kube-apiserver, kube-controller-managera and kubelet logs.
648+
649+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
650+
[existing slis/slos]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

keps/sig-auth/1205-bound-service-account-tokens/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ reviewers:
1111
approvers:
1212
- "@liggitt"
1313
creation-date: 2019-08-06
14-
last-updated: 2021-01-13
14+
last-updated: 2021-02-08
1515
status: implemented
1616
stage: beta
1717
latest-milestone: "v1.21"

0 commit comments

Comments
 (0)