Skip to content

Commit de1f922

Browse files
authored
Merge pull request #4485 from liggitt/authz-config
KEP-3221: Add PRR, update metrics
2 parents 144af03 + 0926385 commit de1f922

File tree

3 files changed

+108
-75
lines changed

3 files changed

+108
-75
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3221
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@jpbetz"

keps/sig-auth/3221-structured-authorization-configuration/README.md

Lines changed: 97 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
- [e2e tests](#e2e-tests)
2525
- [Graduation Criteria](#graduation-criteria)
2626
- [Alpha (1.29)](#alpha-129)
27-
- [Future Alpha versions](#future-alpha-versions)
2827
- [Beta](#beta)
2928
- [GA](#ga)
3029
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
@@ -460,54 +459,100 @@ while the feature is in alpha and beta.
460459

461460
### Monitoring
462461

463-
We will add the following 4 metrics:
462+
We will add the following metrics:
464463

465464
1. `apiserver_authorization_decisions_total`
466465

467-
This will be incremented on round-trip of an authorizer. It will track total
468-
authorization decision invocations across the following labels.
466+
This will be incremented when an authorizer makes a terminal decision (allow or deny).
467+
It will track total authorization decision invocations across the following labels.
469468

470469
Labels {along with possible values}:
471-
- `mode` {<authorizer_name>} # when authorizer is a webhook, prepend `webhook_`
470+
- `type` {<authorizer_type>}
471+
- value matches the configuration `type` field, e.g. `RBAC`, `ABAC`, `Node`, `Webhook`
472+
- `name` {<authorizer_name>}
473+
- value matches the configuration `name` field, e.g. `rbac`, `node`, `abac`, `<webhook name>`
472474
- `decision` {Allow, Deny}
473475

474-
**Note:** Some examples of <authorizer_name>: `RBAC`, `Node`, `ABAC`, `webhook_<name>`.
475-
476476
2. `apiserver_authorization_webhook_evaluations_total`
477477

478-
This will be incremented on round-trip of an authorization webhook. It will track
479-
total invocation counts across the following labels.
478+
This will be incremented on round-trip of an authorization webhook.
479+
It will track total invocation counts across the following labels.
480480

481-
- `name`
482-
- `code` {"incomplete_request", "bad_response"}
481+
- `name` {<authorizer_name>}
482+
- value matches the configuration `name` field
483+
- `result` {canceled, timeout, error, success}
484+
- `canceled`: the call invoking the webhook request was canceled
485+
- `timeout`: the webhook request timed out
486+
- `error`: the webhook response completed and was invalid
487+
- `success`: the webhook response completed and was well-formed
483488

484489
3. `apiserver_authorization_webhook_duration_seconds`
485490

486491
This is a Histogram metric that will track the total round trip time of the requests to the webhook.
487492

488493
Labels {along with possible values}:
489-
- `name`
490-
- `code` {"incomplete_request", "bad_response", "ok"}
494+
- `name` {<authorizer_name>}
495+
- value matches the configuration `name` field
496+
- `result` {canceled, timeout, error, success}
497+
- `canceled`: the call invoking the webhook request was canceled
498+
- `timeout`: the webhook request timed out
499+
- `error`: the webhook response completed and was invalid
500+
- `success`: the webhook response completed and was well-formed
491501

492502
4. `apiserver_authorization_webhook_evaluations_fail_open_total`
493503

494-
This metric will be incremented when a webhook returns `code != errAuthzWebhookOKCode` and
495-
decision on error is not set to `deny`.
504+
This metric will be incremented when a webhook request times out or
505+
returns an invalid response, and the failurePolicy is set to `NoOpinion`.
496506

497507
Labels {along with possible values}:
498508

499-
- `name`
500-
- `code` {"incomplete_request", "bad_response"}
509+
- `name` {<authorizer_name>}
510+
- value matches the configuration `name` field
511+
- `result` {timeout, error}
512+
- `timeout`: the webhook request timed out
513+
- `error`: the webhook response completed and was invalid
501514

502515
5. `apiserver_authorization_config_controller_automatic_reload_last_timestamp_seconds`
503516

504-
This Gauge metric will record last time in seconds when an authorization reload was performed, partitioned by apiserver_id_hash.
517+
This Gauge metric will record last time in seconds when an authorization reload was performed, partitioned by apiserver_id_hash and status.
505518
- `apiserver_id_hash`
519+
- `status` (`success` or `failure`)
506520

507-
6. `apiserver_authorization_config_controller_automatic_reload_failures_total` and `apiserver_authorization_config_controller_automatic_reload_success_total`
521+
6. `apiserver_authorization_config_controller_automatic_reloads_total`
508522

509-
These Counter metrics record the total number of reload successes and failures, partitioned by API server apiserver_id_hash.
523+
This Counter metric records the total number of reload successes and failures, partitioned by API server apiserver_id_hash and status.
510524
- `apiserver_id_hash`
525+
- `status` (`success` or `failure`)
526+
527+
7. `apiserver_authorization_match_condition_evaluation_errors_total`
528+
529+
This will be incremented when an authorization webhook encounters a match condition error.
530+
531+
Labels {along with possible values}:
532+
- `type` {<authorizer_type>}
533+
- Currently only `Webhook` authorizers support match conditions
534+
- `name` {<authorizer_name>}
535+
- value matches the configuration `name` field
536+
537+
8. `apiserver_authorization_match_condition_exclusions_total`
538+
539+
This will be incremented when an authorization webhook is skipped because match conditions exclude it.
540+
541+
Labels {along with possible values}:
542+
- `type` {<authorizer_type>}
543+
- Currently only `Webhook` authorizers support match conditions
544+
- `name` {<authorizer_name>}
545+
- value matches the configuration `name` field
546+
547+
9. `apiserver_authorization_match_condition_evaluation_seconds`
548+
549+
Authorization match condition evaluation time in seconds.
550+
551+
Labels {along with possible values}:
552+
- `type` {<authorizer_type>}
553+
- Currently only `Webhook` authorizers support match conditions
554+
- `name` {<authorizer_name>}
555+
- value matches the configuration `name` field
511556

512557
### Test Plan
513558

@@ -566,12 +611,10 @@ the scenarios.
566611
- Add feature flag for gating usage
567612
- Unit tests and Integration tests to be written
568613

569-
#### Future Alpha versions
570-
571-
- Revisit on the items mentioned in Non-Goals and see if any needs to be implemented
572-
573614
#### Beta
574615

616+
- Observability and metrics complete
617+
- File reloading functionality complete
575618
- Address user reviews and iterate (if needed, keep in Alpha until changes stabilize)
576619
- Feature flag will be turned on by default
577620

@@ -581,17 +624,24 @@ the scenarios.
581624

582625
### Upgrade / Downgrade Strategy
583626

584-
While the feature is in Alpha, there is no change if cluster administrators want to
585-
keep on using command line flags.
627+
There is no change if cluster administrators want to keep on using command line flags.
586628

587-
When the feature goes to Beta/GA or the cluster administrators want to configure
588-
authorizers using a config file, they need to make sure the config file exists before
589-
upgrading the cluster. Similarly when downgrading clusters, they would need to add
590-
the flags back to their bootstrap mechanism.
629+
If the cluster administrators wants to configure authorizers using a config file,
630+
they need to make sure the config file exists before upgrading the cluster.
631+
When downgrading clusters, they would need to switch their invocation back to use flags.
632+
633+
In clusters with multiple API servers, rippling out authorization configuration changes
634+
using a rolling strategy is recommended, verifying the change is effective and functional
635+
on one API server before proceeding to the next API server.
636+
637+
The recommended strategy to switch from command line flags to a config file is to:
638+
639+
1. Switch from command line flags to a config file that expresses an identical configuration
640+
2. Once all servers are successfully operating with the config file, roll out config modifications
591641

592642
### Version Skew Strategy
593643

594-
Not applicable.
644+
Not applicable, authorizers are configured per API server.
595645

596646
## Production Readiness Review Questionnaire
597647

@@ -603,6 +653,8 @@ Not applicable.
603653
- Feature gate name: `StructuredAuthorizationConfiguration`
604654
- Components depending on the feature gate:
605655
- kube-apiserver
656+
- [x] Other
657+
- `kube-apiserver` command-line flag: `--authorization-config`
606658

607659
###### Does enabling the feature change any default behavior?
608660

@@ -630,8 +682,6 @@ command line flags should return an error.
630682

631683
### Rollout, Upgrade and Rollback Planning
632684

633-
> Note: This section is required when targeting Beta to a release.
634-
635685
###### How can a rollout or rollback fail? Can it impact already running workloads?
636686

637687
A rollout can fail when the authorization configuration file being passed doesn't
@@ -658,16 +708,6 @@ No.
658708

659709
### Monitoring Requirements
660710

661-
<!--
662-
This section must be completed when targeting beta to a release.
663-
664-
For GA, this section is required: approvers should be able to confirm the
665-
previous answers based on experience in the field.
666-
-->
667-
668-
> Note: To be elaborated more during Beta graduation since this section
669-
must be completed when targeting beta to a release.
670-
671711
###### How can an operator determine if the feature is in use by workloads?
672712

673713
The cluster administrators can check the flags passed to the `kube-apiserver` if
@@ -730,8 +770,6 @@ None
730770

731771
### Scalability
732772

733-
> Note: This section is good-to-have for Alpha.
734-
735773
###### Will enabling / using this feature result in any new API calls?
736774

737775
No. No additional calls will be made to the Kubernetes API Server.
@@ -755,7 +793,7 @@ cluster administrator defines multiple webhooks.
755793

756794
**Note**: This is a result of the intended feature.
757795
If multiple webhooks are defined and one or more of them are unreachable, the
758-
request latency would get a hit but this is upto the configuration made by the
796+
request latency would get a hit but this is up to the configuration made by the
759797
user. The feature implementation itself doesn't introduce any change to the
760798
existing SLIs/SLOs.
761799

@@ -781,41 +819,28 @@ For use-cases where the CEL filters would pre-filter requests even before the ne
781819
be dispatched to a webhook, there would be a performance improvement due to lower
782820
number of network calls.
783821

784-
### Troubleshooting
822+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
785823

786-
<!--
787-
This section must be completed when targeting beta to a release.
788-
789-
For GA, this section is required: approvers should be able to confirm the
790-
previous answers based on experience in the field.
791-
792-
The Troubleshooting section currently serves the `Playbook` role. We may consider
793-
splitting it into a dedicated `Playbook` document (potentially with some monitoring
794-
details). For now, we leave it here.
795-
-->
824+
No. This feature exists only in the API server.
796825

797-
> Note: To be elaborated more during Beta graduation since this section
798-
must be completed when targeting beta to a release.
826+
### Troubleshooting
799827

800828
###### How does this feature react if the API server and/or etcd is unavailable?
801829

802830
No effect.
803831

804832
###### What are other known failure modes?
805833

806-
<!--
807-
For each of them, fill in the following information by copying the below template:
808-
- [Failure mode brief description]
809-
- Detection: How can it be detected via metrics? Stated another way:
810-
how can an operator troubleshoot without logging into a master or worker node?
811-
- Mitigations: What can be done to stop the bleeding, especially for already
812-
running user workloads?
813-
- Diagnostics: What are the useful log messages and their required logging
814-
levels that could help debug the issue?
815-
Not required until feature graduated to beta.
816-
- Testing: Are there any tests for failure mode? If not, describe why.
817-
-->
818-
834+
- Configuration file cannot be loaded at server start
835+
- Detection: API server process exits
836+
- Mitigation: Revert to previous success invocation or configuration
837+
- Diagnostics: Configuration validation errors are logged at default verbosity.
838+
- Testing: Configuration file loading and validation is unit tested
839+
- Configuration file cannot be reloaded while server is running
840+
- Detection: `apiserver_authorization_config_controller_automatic_reload_last_timestamp_seconds` metric
841+
indicates the `failure` status timestamp is most recent.
842+
- Mitigation: Revert to previous success invocation or configuration
843+
- Diagnostics: Configuration validation errors are logged at default verbosity.
819844

820845
###### What steps should be taken if SLOs are not being met to determine the problem?
821846

keps/sig-auth/3221-structured-authorization-configuration/kep.yaml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,12 @@ feature-gates:
2727
- kube-apiserver
2828
disable-supported: true
2929
metrics:
30-
- apiserver_authorization_config_controller_automatic_reload_last_timestamp_seconds
31-
- apiserver_authorization_config_controller_automatic_reload_success_total{apiserver_id_hash}
32-
- apiserver_authorization_config_controller_automatic_reload_failures_total{apiserver_id_hash}
30+
- apiserver_authorization_decisions_total{type, name, decision}
31+
- apiserver_authorization_webhook_duration_seconds{name, result}
32+
- apiserver_authorization_webhook_evaluations_total{name, result}
33+
- apiserver_authorization_webhook_evaluations_fail_open_total{name, result}
34+
- apiserver_authorization_config_controller_automatic_reload_last_timestamp_seconds{apiserver_id_hash, status}
35+
- apiserver_authorization_config_controller_automatic_reloads_total{apiserver_id_hash, status}
36+
- apiserver_authorization_match_condition_evaluation_errors_total{type, name}
37+
- apiserver_authorization_match_condition_exclusions_total{type, name}
38+
- apiserver_authorization_match_condition_evaluation_seconds{type, name}

0 commit comments

Comments
 (0)