@@ -798,30 +798,38 @@ _This section must be completed when targeting alpha to a release._
798
798
799
799
### Rollout, Upgrade and Rollback Planning
800
800
801
- _ This section must be completed when targeting beta graduation to a release._
802
-
803
801
* ** How can a rollout fail? Can it impact already running workloads?**
804
- Try to be as paranoid as possible - e.g., what if some components will restart
805
- mid-rollout?
802
+
803
+ If ` pod-security.kubernetes.io/enforce ` labels are already present on namespaces,
804
+ upgrading to enable the feature could prevent new pods violating the opted-into
805
+ policy level from being created. Existing running pods would not be disrupted.
806
806
807
807
* ** What specific metrics should inform a rollback?**
808
808
809
+ On a cluster that has not yet opted into enforcement, non-zero counts for either
810
+ of the following metrics mean the feature is not working as expected:
811
+
812
+ * ` pod_security_evaluations_total{decision=deny,mode=enforce} `
813
+ * ` pod_security_evaluations_total{decision=error,mode=enforce} `
814
+
809
815
* ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
810
- Describe manual testing that was done and the outcomes.
811
- Longer term, we may want to require automated upgrade/rollback tests, but we
812
- are missing a bunch of machinery and tooling and can't do that now.
813
816
814
- * ** Is the rollout accompanied by any deprecations and/or removals of features, APIs,
815
- fields of API types, flags, etc.?**
816
- Even if applying deprecation policies, they may still surprise some users.
817
+ * Manual upgrade of the control plane to a version with the feature enabled was tested.
818
+ Existing pods remained running. Creation of new pods in namespaces that did not opt into enforcement was unaffected.
819
+
820
+ * Manual downgrade of the control plane to a version with the feature disabled was tested.
821
+ Existing pods remained running. Creation of new pods in namespaces that had previously opted into enforcement was allowed once more.
822
+
823
+ * ** Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?**
824
+
825
+ No.
817
826
818
827
### Monitoring Requirements
819
828
820
829
* ** How can an operator determine if the feature is in use by workloads?**
821
830
- non-zero ` pod_security_evaluations_total ` metrics indicate the feature is in use
822
831
823
- * ** What are the SLIs (Service Level Indicators) an operator can use to determine
824
- the health of the service?**
832
+ * ** What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
825
833
- [x] Metrics
826
834
- Metric name: ` pod_security_evaluations_total `
827
835
- Components exposing the metric: ` kube-apiserver `
@@ -837,99 +845,91 @@ the health of the service?**
837
845
preventing a user or controller from successfully writing pods
838
846
839
847
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
840
- At a high level, this usually will be in the form of "high percentile of SLI
841
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
842
- high level (needs more precise definitions) those may be things like:
843
- - per-day percentage of API calls finishing with 5XX errors <= 1%
844
- - 99% percentile over day of absolute value from (job creation time minus expected
845
- job creation time) for cron job <= 10%
846
- - 99,9% of /health requests per day finish with 200 code
847
-
848
- * ** Are there any missing metrics that would be useful to have to improve observability
849
- of this feature?**
850
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
851
- implementation difficulties, etc.).
848
+
849
+ - An error rate other than 0 means invalid policy levels or versions were configured
850
+ on a namespace prior to the feature having been enabled. Until this is corrected,
851
+ that namespace will use the latest version of the "restricted" policy for the mode
852
+ that specified an invalid level/version.
852
853
853
- ### Dependencies
854
+ * ** Are there any missing metrics that would be useful to have to improve observability of this feature?**
855
+
856
+ - None we are aware of
854
857
855
- _ This section must be completed when targeting beta graduation to a release. _
858
+ ### Dependencies
856
859
857
860
* ** Does this feature depend on any specific services running in the cluster?**
858
- Think about both cluster-level services (e.g. metrics-server) as well
859
- as node-level agents (e.g. specific version of CRI). Focus on external or
860
- optional services that are needed. For example, if this feature depends on
861
- a cloud provider API, or upon an external software-defined storage or network
862
- control plane.
863
-
864
- For each of these, fill in the following—thinking about running existing user workloads
865
- and creating new ones, as well as about cluster-level services (e.g. DNS):
866
- - [ Dependency name]
867
- - Usage description:
868
- - Impact of its outage on the feature:
869
- - Impact of its degraded performance or high-error rates on the feature:
870
861
862
+ * It exists in the kube-apiserver process and makes use of pre-existing
863
+ capabilities (etcd, namespace/pod informers) that are already inherent to the
864
+ operation of the kube-apiserver.
871
865
872
866
### Scalability
873
867
874
- _ For alpha, this section is encouraged: reviewers should consider these questions
875
- and attempt to answer them._
876
-
877
- _ For beta, this section is required: reviewers must answer these questions._
878
-
879
868
_ For GA, this section is required: approvers should be able to confirm the
880
869
previous answers based on experience in the field._
881
870
882
871
* ** Will enabling / using this feature result in any new API calls?**
883
872
Describe them, providing:
884
- - Updating namespace labels will trigger a list of pods in that namespace. With the built-in
885
- admission plugin, this call will be local within the apiserver. There will be a hard cap on the
886
- number of pods analyzed, and a timeout for the review of those pods. See [ Namespace policy
887
- update warnings] ( #namespace-policy-update-warnings ) .
873
+ - Updating namespace enforcement labels will trigger a list of pods in that namespace.
874
+ With the built-in admission plugin, this call will be local within the apiserver and will use the existing pod informer.
875
+ There will be a hard cap on the number of pods analyzed, and a timeout for the review of those pods
876
+ that ensures evaluation does not exceed a percentage of the time allocated to the request.
877
+ See [ Namespace policy update warnings] ( #namespace-policy-update-warnings ) .
888
878
889
879
* ** Will enabling / using this feature result in introducing new API types?**
890
880
- No.
891
881
892
- * ** Will enabling / using this feature result in any new calls to the cloud
893
- provider?**
882
+ * ** Will enabling / using this feature result in any new calls to the cloud provider?**
894
883
- No.
895
884
896
- * ** Will enabling / using this feature result in increasing size or count of
897
- the existing API objects?**
885
+ * ** Will enabling / using this feature result in increasing size or count of the existing API objects?**
898
886
Describe them, providing:
899
887
- API type(s): Namespaces
900
888
- Estimated increase in size: new labels, up to 300 bytes if all are provided
901
889
- Estimated amount of new objects: 0
902
890
903
- * ** Will enabling / using this feature result in increasing time taken by any
904
- operations covered by [ existing SLIs/SLOs] ?**
905
- - This will require negligible additional work in Pod create/update admission. Namespace label
906
- updates may heavier, but have limits in place.
891
+ * ** Will enabling / using this feature result in increasing time taken by any operations covered by [ existing SLIs/SLOs] ?**
892
+ - This will require negligible additional work in Pod create/update admission.
893
+ - Namespace label updates may heavier, but have limits in place.
907
894
908
- * ** Will enabling / using this feature result in non-negligible increase of
909
- resource usage (CPU, RAM, disk, IO, ...) in any components?**
895
+ * ** Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?**
910
896
- No. Resource usage will be negligible.
897
+ - Initial benchmark cost of pod admission to a fully privileged namespace (default on feature enablement without explicit opt-in)
898
+ - Time: 245.4 ns/op
899
+ - Memory: 112 B/op
900
+ - Allocs: 1 allocs/op
901
+ - Initial benchmark cost of pod admission to a namespace requiring both baseline and restricted evaluation
902
+ - Time: 4826 ns/op
903
+ - Memory: 4616 B/op
904
+ - Allocs: 22 allocs/op
911
905
912
906
### Troubleshooting
913
907
914
908
The Troubleshooting section currently serves the ` Playbook ` role. We may consider
915
909
splitting it into a dedicated ` Playbook ` document (potentially with some monitoring
916
910
details). For now, we leave it here.
917
911
918
- _ This section must be completed when targeting beta graduation to a release._
919
-
920
912
* ** How does this feature react if the API server and/or etcd is unavailable?**
921
913
914
+ - It blocks creation/update of Pod objects, which would have been unavailable anyway.
915
+
922
916
* ** What are other known failure modes?**
923
- For each of them, fill in the following information by copying the below template:
924
- - [ Failure mode brief description]
925
- - Detection: How can it be detected via metrics? Stated another way:
926
- how can an operator troubleshoot without logging into a master or worker node?
927
- - Mitigations: What can be done to stop the bleeding, especially for already
928
- running user workloads?
929
- - Diagnostics: What are the useful log messages and their required logging
930
- levels that could help debug the issue?
931
- Not required until feature graduated to beta.
932
- - Testing: Are there any tests for failure mode? If not, describe why.
917
+
918
+ - Invalid admission configuration
919
+ - Detection: API server will not start / is unavailable
920
+ - Mitigations: Disable the feature or fix the configuration
921
+ - Diagnostics: API server error log
922
+ - Testing: unit testing on configuration validation
923
+
924
+ - Enforce mode rejects pods because invalid level/version defaulted to ` restricted ` level
925
+ - Detection: rising ` pod_security_evaluations_total{decision=error,mode=enforce} ` metric counts
926
+ - Mitigations:
927
+ - Diagnostics:
928
+ - Locate audit logs containing ` pod-security.kubernetes.io/error ` annotations on affected requests
929
+ - Locate namespaces with malformed level labels:
930
+ - ` kubectl get ns --show-labels -l "pod-security.kubernetes.io/enforce,pod-security.kubernetes.io/enforce notin (privileged,baseline,restricted)" `
931
+ - Locate namespaces with malformed version labels:
932
+ - ` kubectl get ns --show-labels -l pod-security.kubernetes.io/enforce-version | egrep -v 'pod-security.kubernetes.io/enforce-version=v1\.[0-9]+(,|$)' `
933
933
934
934
* ** What steps should be taken if SLOs are not being met to determine the problem?**
935
935
0 commit comments