Skip to content

Comments

Add opa policy to restrict PDBs, always allow at least 1 disruption#2459

Merged
viktor-f merged 2 commits intomainfrom
vf/restrict-pdb-policy
Mar 25, 2025
Merged

Add opa policy to restrict PDBs, always allow at least 1 disruption#2459
viktor-f merged 2 commits intomainfrom
vf/restrict-pdb-policy

Conversation

@viktor-f
Copy link
Contributor

@viktor-f viktor-f commented Mar 11, 2025

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

Platform Administrator notice

A new gatekeeper policy has been added that will deny any PodDisruptionBudget and connected Pod controller if the PodDisruptionBudget does not allow at least 1 Pod disruption. Note that this will apply in both sc and wc, it will also apply to namespaces even if they have the label owner=operator.

Application Developer notice

A new gatekeeper policy has been added that will deny any PodDisruptionBudget and connected Pod controller if the PodDisruptionBudget does not allow at least 1 Pod disruption.

What does this PR do / why do we need this PR?

This adds a new gatekeeper policy that will deny any PodDisruptionBudget and connected Pod controller if the PodDisruptionBudget does not allow at least 1 Pod disruption. Note that this will apply in both sc and wc, it will also apply to namespaces even if they have the label owner=operator.

Pod controllers only includes: Deployment, ReplicaSet, StatefulSet, ReplicaController.

The general logic for this is:

If creating or modifying PDB
  If maxUnavailable == 0 or "0%"
    deny request
  If minAvailable
    If any matching pod controller
      If minAvailable is >= number of replicas
        deny request
If creating or modifying pod controller
  If any matching PDB
    Use the logic above under "IF PDB"

In sentences this means that we want to stop either PDBs or pod controllers if they together do not allow for any disruption. If the PDB is using maxUnavailable, then this happens if it is set to 0 or 0%, regardless of the replicas in the pod controller. If the PDB is using minAvailable, then this happens if it is equal or higher than the replicas in the pod controller. The policy includes logic to stop requests that would create or edit both PDBs or pod controllers.

There is also an extra check to allow replicasets to violate this policy if they are controlled by a deployment. This is because any matching PDB will look at all of the pods for the deployment, not any single replicaset.

The logic has been based on an upstream policy, but it has almost entirely been reworked. The logic of matching selectors has been based on the logic in our networkpolicy gatekeeper policy.

I was planning on adding more checks to this. But this took longer to implement than planned, so I'm now planning on turning the rest into a separate task. The things that were not implemented are:

  • Denying pods without pod controllers that are matching a PDB. This is not a pattern we want to allow.
  • Denying PDBs and matching daemonsets/jobs/cronjobs if the PDB is matching a daemonsets/jobs/cronjobs. This is not a pattern we want to allow.
  • Denying PDBs and pods/podcontrollers if the PDB is matching multiple pod controllers or pods without controllers. This is not something we want to allow because "The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlapping selectors." (ref)

This PR adds more resources that will be synced (cached) by gatekeeper. This is needed so that we can compare PDBs to pod controllers (otherwise you just have access to the object that is being validated).
But adding resources here will increase the resource usage (primarily memory) for gatekeeper. Syncing PDBs will likely not increase the usage significantly, since there are relatively few PDBs in a cluster. However pod controllers are a lot more common and will noticeably increase the memory usage, but as I show below I think that the usage has not increased so much more that it should prevent us from adding this feature. I did some testing to see how much the resource usage increased. Note that some of the testing included pods, because that would be needed for one of the extra features mentioned above that I was planning on adding but am now skipping.

Number of resources in the cluster:
Pods: 240
Deployments: 55
StatefulSets: 10
ReplicaSets: 73
ReplicationController: 0

Note: many small pod manifests in this test, actual usage with "real pods" is likely higher (tested in scenario 4 and 5 below)

Without pods or controllers:
Gatekeeper controllers 140 MB
Gatekeeper audit 160 MB, 215 MB spikes
With controllers:
Gatekeeper controllers 160 MB
Gatekeeper audit 175 MB idle, 250 MB spikes
With pods and controllers:
Gatekeeper controllers 180 MB
Gatekeeper audit 200 MB idle, 300 MB spikes
With pods and controllers and extra annotations (to get larger pod manifests):
Gatekeeper controllers 190MB
Gatekeeper audit 210 MB idle, 320 MB spikes
With pods and controllers and extra annotations on a larger cluster (1000 pods, 140 deployments, 160 replicasets)
Gatekeeper controllers 290MB idle, 370 MB spikes
Gatekeeper audit 310 MB idle, 520 MB spikes

Even though this is part of an autoscaling task, this policy is good to have for any type of cluster. Since PDBs can disrupt any type of scenario where we want to drain nodes or similar.

Information to reviewers

As of creating the PR I have not yet added tests or public documentation. I will work on that next, but the code and config should be ok to review for now.

This PR also included some whitespace fixes that I stumbled upon. I hope it is ok that is is fixed in this PR. Otherwise let me know and I will move that to a separate PR.

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@viktor-f viktor-f added the app/opa-gatekeeper Open Policy Agent Gatekeeper label Mar 11, 2025
@viktor-f viktor-f self-assigned this Mar 11, 2025
@viktor-f viktor-f requested a review from a team as a code owner March 11, 2025 10:53
@davidumea davidumea changed the title Add opa policy to restrics PDBs, allways allow at least 1 disruption Add opa policy to restrict PDBs, always allow at least 1 disruption Mar 11, 2025
Copy link
Contributor

@Xartos Xartos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice! This should definitely fix some issues we have in some clusters

@davidumea
Copy link
Contributor

davidumea commented Mar 19, 2025

What happens if a pdb that violates any of the rules already exists? Will it have to be manually removed?

@viktor-f
Copy link
Contributor Author

What happens if a pdb that violates any of the rules already exists? Will it have to be manually removed?

That is a good point. The PDB will be left in place, but you cannot edit it or any deployment in a way that continues to violate these rules. But there should not be any issue to modify the PDB or deployment so that it then is valid. Alternatively remove the PDB and start over.

@davidumea
Copy link
Contributor

Denying PDBs and matching daemonsets/jobs/cronjobs if the PDB is matching a daemonsets/jobs/cronjobs. This is not a pattern we want to allow.

This was one of the patterns you mentioned that was not included in this PR, was it written correctly? If yes, could you expand?


I like that the PR description is very extensive and informative, thanks for that! I think it would be nice if you wrote in text format as well which patterns to not allow were added in PR.

This adds a new gatekeeper policy that will deny any PodDisruptionBudget and connected Pod controller if the PodDisruptionBudget does not allow at least 1 Pod disruption.

I think this is good information but it doesn't explain what is actually happening under the hood 🙂

@viktor-f
Copy link
Contributor Author

Denying PDBs and matching daemonsets/jobs/cronjobs if the PDB is matching a daemonsets/jobs/cronjobs. This is not a pattern we want to allow.

This was one of the patterns you mentioned that was not included in this PR, was it written correctly? If yes, could you expand?

I think the text is accurate, but maybe not very easy to read. Regardless, the idea is that I think PDBs should not be allowed for daemonsets, cronjobs, and jobs.
The part of the reasoning for this is that the PDBs does not have full functionality on these resources (on any resource that does not have the scale api).
For daemonsets it is also odd to have a PDB since the number of pods will change with the number of nodes and the number of nodes might not be controlled by the users that specify the PDB.
For Jobs and cronjobs the purpose is to at some point have the jobs complete, then the number of pods would naturally reduce and you will then eventually violate the PDB.

Does that make sense?

I like that the PR description is very extensive and informative, thanks for that! I think it would be nice if you wrote in text format as well which patterns to not allow were added in PR.

This adds a new gatekeeper policy that will deny any PodDisruptionBudget and connected Pod controller if the PodDisruptionBudget does not allow at least 1 Pod disruption.

I think this is good information but it doesn't explain what is actually happening under the hood 🙂

Thanks, I will try to clarify this.

@viktor-f
Copy link
Contributor Author

Thanks, I will try to clarify this.

I have now updated the PR description with some more details. Please let me know if this explains it well or if I should clarify further.

Copy link
Contributor

@davidumea davidumea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR also included some whitespace fixes that I stumbled upon. I hope it is ok that is is fixed in this PR. Otherwise let me know and I will move that to a separate PR.

I think this is fine, thanks for the cleanup!

I have now updated the PR description with some more details. Please let me know if this explains it well or if I should clarify further.

Thanks it's crystal clear now!

I plan on continuing my review tomorrow

@viktor-f viktor-f force-pushed the vf/restrict-pdb-policy branch from e6ee32a to 8d6c09e Compare March 20, 2025 16:05
@viktor-f
Copy link
Contributor Author

Added a bunch of tests now. I have probably missed some tests, but this should cover most things. I at least have full code coverage:

opa test restrict-pod-disruption-budgets.rego tests/restrict-pod-disruption-budgets.rego -v -c
{
  "files": {
    "restrict-pod-disruption-budgets.rego": {
      ...
      "covered_lines": 98,                                                                                                                                  
      "coverage": 100                                                                                                                                       
    },                                                                                                                                                      
    "tests/restrict-pod-disruption-budgets.rego": {
      ...
      "covered_lines": 257,
      "coverage": 100
    }
  },
  "covered_lines": 355,
  "not_covered_lines": 0,
  "coverage": 100
}

Comment on lines +71 to +73
input_wrap(obj) = input {
input := {"review": {"object": obj}}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice detail 🙂

Comment on lines +79 to +83
pod_controller_groups_kinds := [
{"group": "apps/v1", "kind": "Deployment"},
{"group": "apps/v1", "kind": "StatefulSet"},
{"group": "apps/v1", "kind": "ReplicaSet"},
{"group": "v1", "kind": "ReplicationController"}
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only use the group here in 1 place (below), might want to split these into separate functions, one for group and one for kind.

objs := [controllers | controllers := data.inventory.namespace[pdb.metadata.namespace][pod_controller_group_kind.group][pod_controller_group_kind.kind]]

Or do they need to be together to make sure it's the same object? If I understand it correctly it should take one object at a time so even if the information is fetched from two functions that shouldn't be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to ensure that it just uses these pairs of group and kind. This was the easiest way that I could think of to do this, but there are probably other ways.
I could also just have two separate lists of groups and kinds and then let it go through all combinations. I assume that is slightly less efficient, but probably not significantly.

IMO the current version feels nice to read and understand which groups and kinds belong to each other. But I think we should use the version that most people find easiest to read and understand. So I'm ok with changing it if that is what you and others want.

Copy link
Contributor

@davidumea davidumea Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I would be fine with leaving it but I don't think it reads super well when referencing this pod_controller_group_kind.group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep it like this for now then.

Copy link
Contributor

@davidumea davidumea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks good. Really nice work 🙂

I assume the plan is to add public docs and update the links here before you want to merge this?

@viktor-f
Copy link
Contributor Author

Public docs PR is now up as well https://github.com/elastisys/welkin/pull/1073
The code in this PR have been updated with links to the new public docs page (will not work until the public docs page is merged).
So I think that should be the last thing needed. Except for any additional comments you reviewers might have.

Copy link
Contributor

@davidumea davidumea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice start on this 🙌

@viktor-f viktor-f force-pushed the vf/restrict-pdb-policy branch from 1f6d705 to 80516c9 Compare March 25, 2025 10:21
@viktor-f viktor-f linked an issue Mar 25, 2025 that may be closed by this pull request
2 tasks
@viktor-f
Copy link
Contributor Author

Task for the scenarios that were not covered in this PR: https://github.com/elastisys/welkin-apps/issues/68

@viktor-f viktor-f merged commit 7ebda68 into main Mar 25, 2025
12 checks passed
@viktor-f viktor-f deleted the vf/restrict-pdb-policy branch March 25, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app/opa-gatekeeper Open Policy Agent Gatekeeper

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[3] Add Safeguard for pods that prevent CAPI cluster-autoscaler

3 participants