|
35 | 35 | - [Beta -> GA Graduation](#beta---ga-graduation)
|
36 | 36 | - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
|
37 | 37 | - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
|
| 38 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 39 | + - [Monitoring Requirements](#monitoring-requirements) |
| 40 | + - [Dependencies](#dependencies) |
38 | 41 | - [Scalability](#scalability)
|
| 42 | + - [Troubleshooting](#troubleshooting) |
39 | 43 | <!-- /toc -->
|
40 | 44 |
|
41 | 45 | ## Summary
|
@@ -435,15 +439,16 @@ are properly reloading tokens by:
|
435 | 439 | https://github.com/kubernetes/kubernetes/issues/68164
|
436 | 440 | - [x] Pods running as non root may not access the service account token.
|
437 | 441 | - Fixed in https://github.com/kubernetes/kubernetes/pull/89193
|
| 442 | + - [ ] Dynamic clientbuilder does not invalidate token. |
438 | 443 |
|
439 |
| -- [x] Tests passing |
| 444 | +* [x] Tests passing |
440 | 445 |
|
441 | 446 | - [x] Upgrade test
|
442 | 447 | [sig-auth-serviceaccount-admission-controller-migration](https://k8s-testgrid.appspot.com/sig-auth-gce#upgrade-tests)
|
443 | 448 |
|
444 |
| -- [x] TokenRequest/TokenRequestProjection GA |
| 449 | +* [x] TokenRequest/TokenRequestProjection GA |
445 | 450 |
|
446 |
| -- [x] RootCAConfigMap GA |
| 451 | +* [x] RootCAConfigMap GA |
447 | 452 |
|
448 | 453 | ##### Beta -> GA Graduation
|
449 | 454 |
|
@@ -482,6 +487,79 @@ are properly reloading tokens by:
|
482 | 487 | - upgrade test:
|
483 | 488 | test/e2e/upgrades/serviceaccount_admission_controller_migration.go
|
484 | 489 |
|
| 490 | +### Rollout, Upgrade and Rollback Planning |
| 491 | + |
| 492 | +- **How can a rollout fail? Can it impact already running workloads?** |
| 493 | + |
| 494 | + 1. creation of CA configmap can fail due to permission / quota / admission |
| 495 | + errors. |
| 496 | + 2. newly issued tokens could fail to be recognized by skewed API servers |
| 497 | + not configured with the bound token signing key/issuer. |
| 498 | + |
| 499 | +- **What specific metrics should inform a rollback?** |
| 500 | + |
| 501 | + 1. creation of CA configmap, |
| 502 | + - `root_ca_cert_publisher_rate_limiter_use` |
| 503 | + 2. authentication errors in (n-1) API servers, |
| 504 | + - `authentication_attempts` |
| 505 | + - `authentication_duration_seconds` |
| 506 | + |
| 507 | +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path |
| 508 | + tested?** |
| 509 | + for upgrade, we have set up e2e test running here: |
| 510 | + https://k8s-testgrid.appspot.com/sig-auth-gce#upgrade-tests&width=5 |
| 511 | + |
| 512 | + for downgrade, we have manually tested where a workload continues to |
| 513 | + authenticate successfully. |
| 514 | + |
| 515 | +- **Is the rollout accompanied by any deprecations and/or removals of |
| 516 | + features, APIs, fields of API types, flags, etc.?** no |
| 517 | + |
| 518 | +### Monitoring Requirements |
| 519 | + |
| 520 | +- **How can an operator determine if the feature is in use by workloads?** |
| 521 | + |
| 522 | + Check TokenRequest in use: |
| 523 | + |
| 524 | + - `serviceaccount_valid_tokens_total`: cumulative valid projected service |
| 525 | + account tokens used |
| 526 | + - `serviceaccount_stale_tokens_total`: cumulative stale projected service |
| 527 | + account tokens used |
| 528 | + - `apiserver_request_total`: with labels `group="",version="v1",resource="serviceaccounts",subresource="token"` |
| 529 | + - `apiserver_request_duration_seconds`: with labels `group="",version="v1",resource="serviceaccounts",subresource="token"` |
| 530 | + |
| 531 | +- **What are the SLIs (Service Level Indicators) an operator can use to |
| 532 | + determine the health of the service?** |
| 533 | + |
| 534 | + - [x] Metrics |
| 535 | + - Metric name: apiserver_request_total |
| 536 | + - Aggregation method: group="",version="v1",resource="serviceaccounts",subresource="token" |
| 537 | + - Components exposing the metric: kube-apiserver |
| 538 | + |
| 539 | +- **What are the reasonable SLOs (Service Level Objectives) for the above |
| 540 | + SLIs?** |
| 541 | + |
| 542 | + - per-day percentage of API calls finishing with 5XX errors <= 1% |
| 543 | + |
| 544 | +- **Are there any missing metrics that would be useful to have to improve |
| 545 | + observability of this feature?** |
| 546 | + |
| 547 | + - add granularity to `storage_operation_duration_seconds` to distinguish |
| 548 | + projected volumes: configmap, secret, token,..etc... or add new metrics |
| 549 | + so that we can know the usage of projected tokens. |
| 550 | + |
| 551 | +### Dependencies |
| 552 | + |
| 553 | +- **Does this feature depend on any specific services running in the |
| 554 | + cluster?** There are no new components required, but specific versions of |
| 555 | + kubelet and kube-controller-manager are required |
| 556 | + |
| 557 | + TokenRequest depends on kubelets >= 1.12 |
| 558 | + |
| 559 | + BoundServiceAccountTokenVolume depends on kubelets >= 1.12 with TokenRequest |
| 560 | + enabled (default since 1.12) and kube-controller-manager >= 1.12 with |
| 561 | + RootCAConfigMap feature enabled (default since 1.20) |
| 562 | + |
485 | 563 | ### Scalability
|
486 | 564 |
|
487 | 565 | - **Will enabling / using this feature result in any new API calls?**
|
@@ -511,3 +589,62 @@ are properly reloading tokens by:
|
511 | 589 | - **Will enabling / using this feature result in non-negligible increase of
|
512 | 590 | resource usage (CPU, RAM, disk, IO, ...) in any components?** it adds a
|
513 | 591 | token minting operation in the API server every ~48 minutes for every pod.
|
| 592 | + |
| 593 | +### Troubleshooting |
| 594 | + |
| 595 | +The Troubleshooting section currently serves the `Playbook` role. We may |
| 596 | +consider splitting it into a dedicated `Playbook` document (potentially with |
| 597 | +some monitoring details). For now, we leave it here. |
| 598 | + |
| 599 | +- **How does this feature react if the API server and/or etcd is |
| 600 | + unavailable?** |
| 601 | + |
| 602 | + - TokenRequest API is unavailable |
| 603 | + - configmap containing API server CA bundle cannot be created or fetched |
| 604 | + |
| 605 | +* **What are other known failure modes?** |
| 606 | + |
| 607 | + - failure to issue token via token subresource |
| 608 | + |
| 609 | + - Detection: check `apiserver_request_total` with labels |
| 610 | + `group="",version="v1",resource="serviceaccounts",subresource="token"` |
| 611 | + - Mitigations: disable the BoundServiceAccountTokenVolume feature gate in |
| 612 | + the kube-apiserver and recreate pods. |
| 613 | + - Diagnostics: "failed to generate token" in kube-apiserver log. |
| 614 | + - Testing: [e2e test](https://k8s-testgrid.appspot.com/sig-auth-gce#gce&width=5&include-filter-by-regex=ServiceAccounts%20should%20mount%20projected%20service%20account%20token) |
| 615 | + |
| 616 | + - failure to create root CA config map |
| 617 | + |
| 618 | + - Detection: check `root_ca_cert_publisher_sync_total` from |
| 619 | + kube-controller-manager. (available in 1.21+) |
| 620 | + - Mitigations: disable the BoundServiceAccountTokenVolume feature gate in |
| 621 | + the kube-apiserver and recreate pods. |
| 622 | + - Diagnostics: "syncing [namespace]/[configmap name] failed" in |
| 623 | + kube-controller-manager log. |
| 624 | + - Testing: [e2e test](https://k8s-testgrid.appspot.com/sig-auth-gce#gce&width=5&include-filter-by-regex=ServiceAccounts%20should%20guarantee%20kube-root-ca.crt%20exist%20in%20any%20namespace) |
| 625 | + |
| 626 | + - kubelet fails to renew token |
| 627 | + |
| 628 | + - Detection: check `apiserver_request_total` with labels |
| 629 | + `group="",version="v1",resource="serviceaccounts",subresource="token"` to |
| 630 | + see if failed in requesting a new token; check kubelet log. |
| 631 | + - Mitigations: disable the BoundServiceAccountTokenVolume feature gate in |
| 632 | + the kube-apiserver and recreate pods. |
| 633 | + - Diagnostics: "token [namespace]/[token name] expired and refresh failed" |
| 634 | + in kubelet log. |
| 635 | + - Testing: [e2e test](https://k8s-testgrid.appspot.com/sig-auth-gce#gce-slow&width=5) |
| 636 | + |
| 637 | + - workload fails to refresh token from disk |
| 638 | + |
| 639 | + - Detection: `serviceaccount_stale_tokens_total` emitted by kube-apiserver |
| 640 | + - Mitigations: update client library to newer version. |
| 641 | + - Diagnostics: look for `authentication.k8s.io/stale-token` in audit log if |
| 642 | + `--service-account-extend-token-expiration=true`, or check authentication |
| 643 | + error in kube-apiserver log. |
| 644 | + - Testing: covered in all client libraries' unittests. |
| 645 | + |
| 646 | +* **What steps should be taken if SLOs are not being met to determine the |
| 647 | + problem?** Check kube-apiserver, kube-controller-managera and kubelet logs. |
| 648 | + |
| 649 | +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md |
| 650 | +[existing slis/slos]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos |
0 commit comments