Skip to content

Commit 342208c

Browse files
committed
2579: PRR answers for beta
1 parent 33f195b commit 342208c

File tree

1 file changed

+69
-69
lines changed

1 file changed

+69
-69
lines changed

keps/sig-auth/2579-psp-replacement/README.md

Lines changed: 69 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -798,30 +798,38 @@ _This section must be completed when targeting alpha to a release._
798798

799799
### Rollout, Upgrade and Rollback Planning
800800

801-
_This section must be completed when targeting beta graduation to a release._
802-
803801
* **How can a rollout fail? Can it impact already running workloads?**
804-
Try to be as paranoid as possible - e.g., what if some components will restart
805-
mid-rollout?
802+
803+
If `pod-security.kubernetes.io/enforce` labels are already present on namespaces,
804+
upgrading to enable the feature could prevent new pods violating the opted-into
805+
policy level from being created. Existing running pods would not be disrupted.
806806

807807
* **What specific metrics should inform a rollback?**
808808

809+
On a cluster that has not yet opted into enforcement, non-zero counts for either
810+
of the following metrics mean the feature is not working as expected:
811+
812+
* `pod_security_evaluations_total{decision=deny,mode=enforce}`
813+
* `pod_security_evaluations_total{decision=error,mode=enforce}`
814+
809815
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
810-
Describe manual testing that was done and the outcomes.
811-
Longer term, we may want to require automated upgrade/rollback tests, but we
812-
are missing a bunch of machinery and tooling and can't do that now.
813816

814-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
815-
fields of API types, flags, etc.?**
816-
Even if applying deprecation policies, they may still surprise some users.
817+
* Manual upgrade of the control plane to a version with the feature enabled was tested.
818+
Existing pods remained running. Creation of new pods in namespaces that did not opt into enforcement was unaffected.
819+
820+
* Manual downgrade of the control plane to a version with the feature disabled was tested.
821+
Existing pods remained running. Creation of new pods in namespaces that had previously opted into enforcement was allowed once more.
822+
823+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?**
824+
825+
No.
817826

818827
### Monitoring Requirements
819828

820829
* **How can an operator determine if the feature is in use by workloads?**
821830
- non-zero `pod_security_evaluations_total` metrics indicate the feature is in use
822831

823-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
824-
the health of the service?**
832+
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
825833
- [x] Metrics
826834
- Metric name: `pod_security_evaluations_total`
827835
- Components exposing the metric: `kube-apiserver`
@@ -837,99 +845,91 @@ the health of the service?**
837845
preventing a user or controller from successfully writing pods
838846

839847
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
840-
At a high level, this usually will be in the form of "high percentile of SLI
841-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
842-
high level (needs more precise definitions) those may be things like:
843-
- per-day percentage of API calls finishing with 5XX errors <= 1%
844-
- 99% percentile over day of absolute value from (job creation time minus expected
845-
job creation time) for cron job <= 10%
846-
- 99,9% of /health requests per day finish with 200 code
847-
848-
* **Are there any missing metrics that would be useful to have to improve observability
849-
of this feature?**
850-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
851-
implementation difficulties, etc.).
848+
849+
- An error rate other than 0 means invalid policy levels or versions were configured
850+
on a namespace prior to the feature having been enabled. Until this is corrected,
851+
that namespace will use the latest version of the "restricted" policy for the mode
852+
that specified an invalid level/version.
852853

853-
### Dependencies
854+
* **Are there any missing metrics that would be useful to have to improve observability of this feature?**
855+
856+
- None we are aware of
854857

855-
_This section must be completed when targeting beta graduation to a release._
858+
### Dependencies
856859

857860
* **Does this feature depend on any specific services running in the cluster?**
858-
Think about both cluster-level services (e.g. metrics-server) as well
859-
as node-level agents (e.g. specific version of CRI). Focus on external or
860-
optional services that are needed. For example, if this feature depends on
861-
a cloud provider API, or upon an external software-defined storage or network
862-
control plane.
863-
864-
For each of these, fill in the following—thinking about running existing user workloads
865-
and creating new ones, as well as about cluster-level services (e.g. DNS):
866-
- [Dependency name]
867-
- Usage description:
868-
- Impact of its outage on the feature:
869-
- Impact of its degraded performance or high-error rates on the feature:
870861

862+
* It exists in the kube-apiserver process and makes use of pre-existing
863+
capabilities (etcd, namespace/pod informers) that are already inherent to the
864+
operation of the kube-apiserver.
871865

872866
### Scalability
873867

874-
_For alpha, this section is encouraged: reviewers should consider these questions
875-
and attempt to answer them._
876-
877-
_For beta, this section is required: reviewers must answer these questions._
878-
879868
_For GA, this section is required: approvers should be able to confirm the
880869
previous answers based on experience in the field._
881870

882871
* **Will enabling / using this feature result in any new API calls?**
883872
Describe them, providing:
884-
- Updating namespace labels will trigger a list of pods in that namespace. With the built-in
885-
admission plugin, this call will be local within the apiserver. There will be a hard cap on the
886-
number of pods analyzed, and a timeout for the review of those pods. See [Namespace policy
887-
update warnings](#namespace-policy-update-warnings).
873+
- Updating namespace enforcement labels will trigger a list of pods in that namespace.
874+
With the built-in admission plugin, this call will be local within the apiserver and will use the existing pod informer.
875+
There will be a hard cap on the number of pods analyzed, and a timeout for the review of those pods
876+
that ensures evaluation does not exceed a percentage of the time allocated to the request.
877+
See [Namespace policy update warnings](#namespace-policy-update-warnings).
888878

889879
* **Will enabling / using this feature result in introducing new API types?**
890880
- No.
891881

892-
* **Will enabling / using this feature result in any new calls to the cloud
893-
provider?**
882+
* **Will enabling / using this feature result in any new calls to the cloud provider?**
894883
- No.
895884

896-
* **Will enabling / using this feature result in increasing size or count of
897-
the existing API objects?**
885+
* **Will enabling / using this feature result in increasing size or count of the existing API objects?**
898886
Describe them, providing:
899887
- API type(s): Namespaces
900888
- Estimated increase in size: new labels, up to 300 bytes if all are provided
901889
- Estimated amount of new objects: 0
902890

903-
* **Will enabling / using this feature result in increasing time taken by any
904-
operations covered by [existing SLIs/SLOs]?**
905-
- This will require negligible additional work in Pod create/update admission. Namespace label
906-
updates may heavier, but have limits in place.
891+
* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?**
892+
- This will require negligible additional work in Pod create/update admission.
893+
- Namespace label updates may heavier, but have limits in place.
907894

908-
* **Will enabling / using this feature result in non-negligible increase of
909-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
895+
* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?**
910896
- No. Resource usage will be negligible.
897+
- Initial benchmark cost of pod admission to a fully privileged namespace (default on feature enablement without explicit opt-in)
898+
- Time: 245.4 ns/op
899+
- Memory: 112 B/op
900+
- Allocs: 1 allocs/op
901+
- Initial benchmark cost of pod admission to a namespace requiring both baseline and restricted evaluation
902+
- Time: 4826 ns/op
903+
- Memory: 4616 B/op
904+
- Allocs: 22 allocs/op
911905

912906
### Troubleshooting
913907

914908
The Troubleshooting section currently serves the `Playbook` role. We may consider
915909
splitting it into a dedicated `Playbook` document (potentially with some monitoring
916910
details). For now, we leave it here.
917911

918-
_This section must be completed when targeting beta graduation to a release._
919-
920912
* **How does this feature react if the API server and/or etcd is unavailable?**
921913

914+
- It blocks creation/update of Pod objects, which would have been unavailable anyway.
915+
922916
* **What are other known failure modes?**
923-
For each of them, fill in the following information by copying the below template:
924-
- [Failure mode brief description]
925-
- Detection: How can it be detected via metrics? Stated another way:
926-
how can an operator troubleshoot without logging into a master or worker node?
927-
- Mitigations: What can be done to stop the bleeding, especially for already
928-
running user workloads?
929-
- Diagnostics: What are the useful log messages and their required logging
930-
levels that could help debug the issue?
931-
Not required until feature graduated to beta.
932-
- Testing: Are there any tests for failure mode? If not, describe why.
917+
918+
- Invalid admission configuration
919+
- Detection: API server will not start / is unavailable
920+
- Mitigations: Disable the feature or fix the configuration
921+
- Diagnostics: API server error log
922+
- Testing: unit testing on configuration validation
923+
924+
- Enforce mode rejects pods because invalid level/version defaulted to `restricted` level
925+
- Detection: rising `pod_security_evaluations_total{decision=error,mode=enforce}` metric counts
926+
- Mitigations:
927+
- Diagnostics:
928+
- Locate audit logs containing `pod-security.kubernetes.io/error` annotations on affected requests
929+
- Locate namespaces with malformed level labels:
930+
- `kubectl get ns --show-labels -l "pod-security.kubernetes.io/enforce,pod-security.kubernetes.io/enforce notin (privileged,baseline,restricted)"`
931+
- Locate namespaces with malformed version labels:
932+
- `kubectl get ns --show-labels -l pod-security.kubernetes.io/enforce-version | egrep -v 'pod-security.kubernetes.io/enforce-version=v1\.[0-9]+(,|$)'`
933933

934934
* **What steps should be taken if SLOs are not being met to determine the problem?**
935935

0 commit comments

Comments
 (0)