24
24
- [ e2e tests] ( #e2e-tests )
25
25
- [ Graduation Criteria] ( #graduation-criteria )
26
26
- [ Alpha (1.29)] ( #alpha-129 )
27
- - [ Future Alpha versions] ( #future-alpha-versions )
28
27
- [ Beta] ( #beta )
29
28
- [ GA] ( #ga )
30
29
- [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
@@ -460,54 +459,100 @@ while the feature is in alpha and beta.
460
459
461
460
# ## Monitoring
462
461
463
- We will add the following 4 metrics :
462
+ We will add the following metrics :
464
463
465
464
1. `apiserver_authorization_decisions_total`
466
465
467
- This will be incremented on round-trip of an authorizer. It will track total
468
- authorization decision invocations across the following labels.
466
+ This will be incremented when an authorizer makes a terminal decision (allow or deny).
467
+ It will track total authorization decision invocations across the following labels.
469
468
470
469
Labels {along with possible values} :
471
- - ` mode` {<authorizer_name>} # when authorizer is a webhook, prepend `webhook_`
470
+ - ` type` {<authorizer_type>}
471
+ - value matches the configuration `type` field, e.g. `RBAC`, `ABAC`, `Node`, `Webhook`
472
+ - ` name` {<authorizer_name>}
473
+ - value matches the configuration `name` field, e.g. `rbac`, `node`, `abac`, `<webhook name>`
472
474
- ` decision` {Allow, Deny}
473
475
474
- **Note:** Some examples of <authorizer_name>: `RBAC`, `Node`, `ABAC`, `webhook_<name>`.
475
-
476
476
2. `apiserver_authorization_webhook_evaluations_total`
477
477
478
- This will be incremented on round-trip of an authorization webhook. It will track
479
- total invocation counts across the following labels.
478
+ This will be incremented on round-trip of an authorization webhook.
479
+ It will track total invocation counts across the following labels.
480
480
481
- - ` name`
482
- - ` code` {"incomplete_request", "bad_response"}
481
+ - ` name` {<authorizer_name>}
482
+ - value matches the configuration `name` field
483
+ - ` result` {canceled, timeout, error, success}
484
+ - `canceled` : the call invoking the webhook request was canceled
485
+ - `timeout` : the webhook request timed out
486
+ - `error` : the webhook response completed and was invalid
487
+ - `success` : the webhook response completed and was well-formed
483
488
484
489
3. `apiserver_authorization_webhook_duration_seconds`
485
490
486
491
This is a Histogram metric that will track the total round trip time of the requests to the webhook.
487
492
488
493
Labels {along with possible values} :
489
- - ` name`
490
- - ` code` {"incomplete_request", "bad_response", "ok"}
494
+ - ` name` {<authorizer_name>}
495
+ - value matches the configuration `name` field
496
+ - ` result` {canceled, timeout, error, success}
497
+ - `canceled` : the call invoking the webhook request was canceled
498
+ - `timeout` : the webhook request timed out
499
+ - `error` : the webhook response completed and was invalid
500
+ - `success` : the webhook response completed and was well-formed
491
501
492
502
4. `apiserver_authorization_webhook_evaluations_fail_open_total`
493
503
494
- This metric will be incremented when a webhook returns `code != errAuthzWebhookOKCode` and
495
- decision on error is not set to `deny `.
504
+ This metric will be incremented when a webhook request times out or
505
+ returns an invalid response, and the failurePolicy is set to `NoOpinion `.
496
506
497
507
Labels {along with possible values} :
498
508
499
- - ` name`
500
- - ` code` {"incomplete_request", "bad_response"}
509
+ - ` name` {<authorizer_name>}
510
+ - value matches the configuration `name` field
511
+ - ` result` {timeout, error}
512
+ - `timeout` : the webhook request timed out
513
+ - `error` : the webhook response completed and was invalid
501
514
502
515
5. `apiserver_authorization_config_controller_automatic_reload_last_timestamp_seconds`
503
516
504
- This Gauge metric will record last time in seconds when an authorization reload was performed, partitioned by apiserver_id_hash.
517
+ This Gauge metric will record last time in seconds when an authorization reload was performed, partitioned by apiserver_id_hash and status .
505
518
- ` apiserver_id_hash`
519
+ - ` status` (`success` or `failure`)
506
520
507
- 6. `apiserver_authorization_config_controller_automatic_reload_failures_total` and `apiserver_authorization_config_controller_automatic_reload_success_total `
521
+ 6. `apiserver_authorization_config_controller_automatic_reloads_total `
508
522
509
- These Counter metrics record the total number of reload successes and failures, partitioned by API server apiserver_id_hash.
523
+ This Counter metric records the total number of reload successes and failures, partitioned by API server apiserver_id_hash and status .
510
524
- ` apiserver_id_hash`
525
+ - ` status` (`success` or `failure`)
526
+
527
+ 7. `apiserver_authorization_match_condition_evaluation_errors_total`
528
+
529
+ This will be incremented when an authorization webhook encounters a match condition error.
530
+
531
+ Labels {along with possible values} :
532
+ - ` type` {<authorizer_type>}
533
+ - Currently only `Webhook` authorizers support match conditions
534
+ - ` name` {<authorizer_name>}
535
+ - value matches the configuration `name` field
536
+
537
+ 8. `apiserver_authorization_match_condition_exclusions_total`
538
+
539
+ This will be incremented when an authorization webhook is skipped because match conditions exclude it.
540
+
541
+ Labels {along with possible values} :
542
+ - ` type` {<authorizer_type>}
543
+ - Currently only `Webhook` authorizers support match conditions
544
+ - ` name` {<authorizer_name>}
545
+ - value matches the configuration `name` field
546
+
547
+ 9. `apiserver_authorization_match_condition_evaluation_seconds`
548
+
549
+ Authorization match condition evaluation time in seconds.
550
+
551
+ Labels {along with possible values} :
552
+ - ` type` {<authorizer_type>}
553
+ - Currently only `Webhook` authorizers support match conditions
554
+ - ` name` {<authorizer_name>}
555
+ - value matches the configuration `name` field
511
556
512
557
# ## Test Plan
513
558
@@ -566,12 +611,10 @@ the scenarios.
566
611
- Add feature flag for gating usage
567
612
- Unit tests and Integration tests to be written
568
613
569
- # ### Future Alpha versions
570
-
571
- - Revisit on the items mentioned in Non-Goals and see if any needs to be implemented
572
-
573
614
# ### Beta
574
615
616
+ - Observability and metrics complete
617
+ - File reloading functionality complete
575
618
- Address user reviews and iterate (if needed, keep in Alpha until changes stabilize)
576
619
- Feature flag will be turned on by default
577
620
@@ -581,17 +624,24 @@ the scenarios.
581
624
582
625
# ## Upgrade / Downgrade Strategy
583
626
584
- While the feature is in Alpha, there is no change if cluster administrators want to
585
- keep on using command line flags.
627
+ There is no change if cluster administrators want to keep on using command line flags.
586
628
587
- When the feature goes to Beta/GA or the cluster administrators want to configure
588
- authorizers using a config file, they need to make sure the config file exists before
589
- upgrading the cluster. Similarly when downgrading clusters, they would need to add
590
- the flags back to their bootstrap mechanism.
629
+ If the cluster administrators wants to configure authorizers using a config file,
630
+ they need to make sure the config file exists before upgrading the cluster.
631
+ When downgrading clusters, they would need to switch their invocation back to use flags.
632
+
633
+ In clusters with multiple API servers, rippling out authorization configuration changes
634
+ using a rolling strategy is recommended, verifying the change is effective and functional
635
+ on one API server before proceeding to the next API server.
636
+
637
+ The recommended strategy to switch from command line flags to a config file is to :
638
+
639
+ 1. Switch from command line flags to a config file that expresses an identical configuration
640
+ 2. Once all servers are successfully operating with the config file, roll out config modifications
591
641
592
642
# ## Version Skew Strategy
593
643
594
- Not applicable.
644
+ Not applicable, authorizers are configured per API server .
595
645
596
646
# # Production Readiness Review Questionnaire
597
647
@@ -603,6 +653,8 @@ Not applicable.
603
653
- Feature gate name : ` StructuredAuthorizationConfiguration`
604
654
- Components depending on the feature gate :
605
655
- kube-apiserver
656
+ - [x] Other
657
+ - `kube-apiserver` command-line flag : ` --authorization-config`
606
658
607
659
# ##### Does enabling the feature change any default behavior?
608
660
@@ -630,8 +682,6 @@ command line flags should return an error.
630
682
631
683
# ## Rollout, Upgrade and Rollback Planning
632
684
633
- > Note: This section is required when targeting Beta to a release.
634
-
635
685
# ##### How can a rollout or rollback fail? Can it impact already running workloads?
636
686
637
687
A rollout can fail when the authorization configuration file being passed doesn't
658
708
659
709
# ## Monitoring Requirements
660
710
661
- <!--
662
- This section must be completed when targeting beta to a release.
663
-
664
- For GA, this section is required : approvers should be able to confirm the
665
- previous answers based on experience in the field.
666
- -->
667
-
668
- > Note: To be elaborated more during Beta graduation since this section
669
- must be completed when targeting beta to a release.
670
-
671
711
# ##### How can an operator determine if the feature is in use by workloads?
672
712
673
713
The cluster administrators can check the flags passed to the `kube-apiserver` if
730
770
731
771
# ## Scalability
732
772
733
- > Note: This section is good-to-have for Alpha.
734
-
735
773
# ##### Will enabling / using this feature result in any new API calls?
736
774
737
775
No. No additional calls will be made to the Kubernetes API Server.
@@ -755,7 +793,7 @@ cluster administrator defines multiple webhooks.
755
793
756
794
**Note**: This is a result of the intended feature.
757
795
If multiple webhooks are defined and one or more of them are unreachable, the
758
- request latency would get a hit but this is upto the configuration made by the
796
+ request latency would get a hit but this is up to the configuration made by the
759
797
user. The feature implementation itself doesn't introduce any change to the
760
798
existing SLIs/SLOs.
761
799
@@ -781,41 +819,28 @@ For use-cases where the CEL filters would pre-filter requests even before the ne
781
819
be dispatched to a webhook, there would be a performance improvement due to lower
782
820
number of network calls.
783
821
784
- # ## Troubleshooting
822
+ # ##### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
785
823
786
- <!--
787
- This section must be completed when targeting beta to a release.
788
-
789
- For GA, this section is required : approvers should be able to confirm the
790
- previous answers based on experience in the field.
791
-
792
- The Troubleshooting section currently serves the `Playbook` role. We may consider
793
- splitting it into a dedicated `Playbook` document (potentially with some monitoring
794
- details). For now, we leave it here.
795
- -->
824
+ No. This feature exists only in the API server.
796
825
797
- > Note: To be elaborated more during Beta graduation since this section
798
- must be completed when targeting beta to a release.
826
+ # ## Troubleshooting
799
827
800
828
# ##### How does this feature react if the API server and/or etcd is unavailable?
801
829
802
830
No effect.
803
831
804
832
# ##### What are other known failure modes?
805
833
806
- <!--
807
- For each of them, fill in the following information by copying the below template :
808
- - [Failure mode brief description]
809
- - Detection : How can it be detected via metrics? Stated another way:
810
- how can an operator troubleshoot without logging into a master or worker node?
811
- - Mitigations : What can be done to stop the bleeding, especially for already
812
- running user workloads?
813
- - Diagnostics : What are the useful log messages and their required logging
814
- levels that could help debug the issue?
815
- Not required until feature graduated to beta.
816
- - Testing : Are there any tests for failure mode? If not, describe why.
817
- -->
818
-
834
+ - Configuration file cannot be loaded at server start
835
+ - Detection : API server process exits
836
+ - Mitigation : Revert to previous success invocation or configuration
837
+ - Diagnostics : Configuration validation errors are logged at default verbosity.
838
+ - Testing : Configuration file loading and validation is unit tested
839
+ - Configuration file cannot be reloaded while server is running
840
+ - Detection : ` apiserver_authorization_config_controller_automatic_reload_last_timestamp_seconds` metric
841
+ indicates the `failure` status timestamp is most recent.
842
+ - Mitigation : Revert to previous success invocation or configuration
843
+ - Diagnostics : Configuration validation errors are logged at default verbosity.
819
844
820
845
# ##### What steps should be taken if SLOs are not being met to determine the problem?
821
846
0 commit comments