Skip to content

Commit 1140e3b

Browse files
tallclairliggitt
andauthored
[PodSecurity] Update monitoring proposal (#2990)
* [PodSecurity] Update monitoring proposal * fixup! [PodSecurity] Update monitoring proposal * Apply suggestions from code review Co-authored-by: Jordan Liggitt <[email protected]> Co-authored-by: Jordan Liggitt <[email protected]>
1 parent b6b7028 commit 1140e3b

File tree

2 files changed

+58
-14
lines changed

2 files changed

+58
-14
lines changed

keps/sig-auth/2579-psp-replacement/README.md

Lines changed: 56 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -598,31 +598,67 @@ coverage of unit tests.
598598

599599
### Monitoring
600600

601-
A single metric will be added to track policy evaluations against pods and [templated pods].
602-
[Namespace evaluations](#namespace-policy-update-warnings) are not counted.
601+
Three metrics will be introduced:
603602

604603
```
605604
pod_security_evaluations_total
606605
```
607606

607+
This metric will be added to track policy evaluations against pods and [templated pods].
608+
[Namespace evaluations](#namespace-policy-update-warnings) are not counted.
609+
The metric will only be incremented when the policy check is actually performed. In other words,
610+
this metric will not be incremented if any of the following are true:
611+
612+
- Ignored resource types, subresources, or workload resources without a pod template
613+
- Update requests that are out of scope (see [Updates](#updates) above)
614+
- Exempt requests (these are reported in the `pod_security_exemptions_total` metric instead)
615+
- Errors that make policy evaluation impossible (these are reported in the `pod_security_exemptions_total` metric instead)
616+
608617
The metric will use the following labels:
609618

610-
1. `decision {exempt, allow, deny, error}` - The policy decision. Error is reserved for panics or
611-
other errors in policy evaluation. Update requests that are out of scope (see [Updates](#updates)
612-
above) are not counted.
619+
1. `decision {allow, deny}` - The policy decision. `allow` is only recorded with `enforce` mode.
613620
3. `policy_level {privileged, baseline, restricted}` - The policy level that the request was
614621
evaluated against.
615622
4. `policy_version {v1.X, v1.Y, latest, future}` - The policy version that was used for the evaluation.
616623
Explicit versions less than or equal to the build of the API server or webhook are recorded in the form `v1.x` (e.g. `v1.22`).
617624
Explicit versions greater than the build of the API server or webhook (which are evaluated as `latest`) are recorded as `future`.
618625
Explicit use of the `latest` version or implicit use by omitting a version or specifying an unparseable version will be recorded as `latest`.
619626
5. `mode {enforce, warn, audit}` - The type of evaluation mode being recorded. Note that a single
620-
request can increment this metric 3 times, once for each mode. If this admission controller is
621-
enabled, every every create request and in-scope update request will at least increment the
622-
`enforce` total. Privileged evaluations for warn and audit modes are not counted.
627+
request can increment this metric 3 times, once for each mode. `audit` and `warn` mode metrics
628+
are only incremented for violations. If this admission controller is enabled, every
629+
evaluated request will at least increment the `enforce` total.
623630
6. `request_operation {create, update}` - The operation of the request being checked.
624631
7. `resource {pod, controller}` - Whether the request object is a Pod, or a [templated
625632
pod](#podtemplate-resources) resource.
633+
8. `subresource {ephemeralcontainers}` - The subresource, when relevant & in scope.
634+
635+
```
636+
pod_security_exemptions_total
637+
```
638+
639+
This metric will be added to track requests that are considered exempt. Ignored resources and out of
640+
scope requests do not count towards the total. Errors encountered before the exemption logic will
641+
not be counted as exempt.
642+
643+
The metric will use the following labels. The definitions match from the above label definitions.
644+
645+
1. `request_operation {create, update}`
646+
2. `resource {pod, controller}`
647+
3. `subresource {ephemeralcontainers}`
648+
649+
```
650+
pod_security_errors_total
651+
```
652+
653+
This metric will be added to track errors encountered during request evaluation.
654+
655+
The metric will use the following labels. The definitions match from the above label definitions.
656+
657+
1. `fatal {true, false}` - Whether the error prevented evaluation (short-circuit deny). If
658+
`fatal=false` then the latest restricted profile may be used to evaluate the pod.
659+
2. `request_operation {create, update}`
660+
3. `resource {pod, controller}`
661+
4. `subresource {ephemeralcontainers}`
626662

627663
### Audit Annotations
628664

@@ -810,7 +846,7 @@ _This section must be completed when targeting alpha to a release._
810846
of the following metrics mean the feature is not working as expected:
811847

812848
* `pod_security_evaluations_total{decision=deny,mode=enforce}`
813-
* `pod_security_evaluations_total{decision=error,mode=enforce}`
849+
* `pod_security_errors_total`
814850

815851
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
816852

@@ -831,15 +867,21 @@ _This section must be completed when targeting alpha to a release._
831867

832868
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
833869
- [x] Metrics
834-
- Metric name: `pod_security_evaluations_total`
870+
- Metric name: `pod_security_evaluations_total`, `pod_security_errors_total`
835871
- Components exposing the metric: `kube-apiserver`
836872

837873
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
838-
- `pod_security_evaluations_total{decision=error}`
874+
- `pod_security_errors_total`
839875
- any rising count of these metrics indicates an unexpected problem evaluating the policy
840-
- `pod_security_evaluations_total{decision=error,mode=enforce}`
876+
- `pod_security_errors_total{fatal=true}`
841877
- any rising count of these metrics indicates an unexpected problem evaluating the policy that
842878
is preventing pod write requests
879+
- `pod_security_errors_total{fatal=false}`,
880+
`pod_security_evaluations_total{decision=deny,mode=enforce,level=restricted,version=latest}`
881+
- a rising count of non-fatal errors indicates an error resolving namespace policies, which
882+
causes PodSecurity to default to enforcing `restricted:latest`
883+
- a corresponding rise in `restricted:latest` denials may indicate that these errors are
884+
preventing pod write requests
843885
- `pod_security_evaluations_total{decision=deny,mode=enforce}`
844886
- a rising count indicates that the policy is preventing pod creation as intended, but is
845887
preventing a user or controller from successfully writing pods
@@ -922,8 +964,8 @@ details). For now, we leave it here.
922964
- Testing: unit testing on configuration validation
923965

924966
- Enforce mode rejects pods because invalid level/version defaulted to `restricted` level
925-
- Detection: rising `pod_security_evaluations_total{decision=error,mode=enforce}` metric counts
926-
- Mitigations:
967+
- Detection: rising `pod_security_errors_total{fatal=false}` metric counts
968+
- Mitigations: fix the malformed labels
927969
- Diagnostics:
928970
- Locate audit logs containing `pod-security.kubernetes.io/error` annotations on affected requests
929971
- Locate namespaces with malformed level labels:

keps/sig-auth/2579-psp-replacement/kep.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,3 +53,5 @@ disable-supported: true
5353
# The following PRR answers are required at beta release
5454
metrics:
5555
- pod_security_evaluations_total
56+
- pod_security_exemptions_total
57+
- pod_security_errors_total

0 commit comments

Comments
 (0)