Skip to content

Conversation

@MirahImage
Copy link
Member

@MirahImage MirahImage commented Oct 1, 2025

This closes #1910 and #1960

Adds a securityContext to the cluster operator deployment context.
Adds securityContext to RabbitMQ Pods, containers, and init containers.

@MirahImage MirahImage changed the title Add security context to manager deployment spec. WIP DO NOT MERGE Add security context Oct 1, 2025
@MirahImage MirahImage changed the title WIP DO NOT MERGE Add security context Add security context Oct 1, 2025
@MirahImage MirahImage marked this pull request as ready for review October 1, 2025 13:18
@MirahImage MirahImage requested a review from Zerpet October 1, 2025 13:18
@MirahImage
Copy link
Member Author

Given the security context appears in 4 different places (once in the operator deployment itself, then in three places in the cluster statefulset), I suspect we may need to improve our documentation about overrides for OpenShift.

@Zerpet
Copy link
Member

Zerpet commented Oct 2, 2025

Docs regarding openshift will need some updating. I'm going to run this locally in CRC, but we likely need to override the RunAsUser bit in favour of the FSGroup

@MirahImage
Copy link
Member Author

Let me know if you've figured out the necessary OpenShift overrides (or just push a commit to the branch).

@Zerpet
Copy link
Member

Zerpet commented Oct 6, 2025

I had to switch to something else on Friday. I'm looking into it now.

@Zerpet
Copy link
Member

Zerpet commented Oct 6, 2025

Indeed it doesn't like the RunAsUser: 999 nor the FSGroup: 0

STS condition

Warning  FailedCreate      8m27s (x16 over 11m)  statefulset-controller  create Pod rabbitmq-server-0 in StatefulSet rabbitmq-server failed error: pods "rabbitmq-server-0" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or servieaccount, provider restricted-v2: .spec.securityContext.fsGroup: Invalid value: [[]int64{0}: 0 is not an allowed group, provider restricted-v2: .initContainers[0].runAsUser: Invalid value: 999: must be in the ranges: [1000660000, 1000669999], provider restricted-v2: .containers[0].runAsUser: Invalid value: 999: must be in the ranges: [1000660000, 1000669999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid-v2": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "hostpath-provisioner": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

The error describes the problem:

provider restricted-v2: 
  .spec.securityContext.fsGroup: Invalid value: [[]int64{0}: 0 is not an allowed group, 
provider restricted-v2: 
  .initContainers[0].runAsUser: Invalid value: 999: must be in the ranges: [1000660000, 1000669999], 
provider restricted-v2: 
  .containers[0].runAsUser: Invalid value: 999: must be in the ranges: [1000660000, 1000669999]

This can be workaround with the suggestion from our docs that basically removes the Pod security context. With the security contexts in the containers, it's enough to pass the "restricted" validation.

A potential improvement would be to set to nil, or not set at all, the problematic fields. It left unset, Openshift does the right thing and assigns a valid value. We could customise the behaviour for Openshift if an annotation e.g. rabbitmq.com/platform-is-openshift: some-value-doesnt-matter is present in RabbitmqCluster objtect.

I'll leave it to you to decide if you want to implement the annotation behaviour.

@MirahImage MirahImage merged commit cc7a165 into main Oct 6, 2025
13 checks passed
@MirahImage MirahImage deleted the add-security-context branch October 6, 2025 12:59
@Zerpet Zerpet added this to the 2.17.0 milestone Oct 7, 2025
@regressEdo
Copy link

regressEdo commented Oct 15, 2025

For the cluster operator deployment, shouldn't this be split between pod and container securityContext? Example:

spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: operator
          ...
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - All

@Zerpet
Copy link
Member

Zerpet commented Oct 17, 2025

Not necessarily, because the operator only has 1 container (rabbit has 2) and it doesn't have documented/supported side-car configurations. RabbitMQ, on the other hand, has 2 containers (init + rabbit), and it has the override feature, which can include additional containers. For RabbitMQ, the Pod security context will "catch" those containers coming from the override.

@bo0ts
Copy link

bo0ts commented Dec 9, 2025

Quick feedback on this change: this broke all our RabbitMQ deployments in OpenShift and OKD.

We deploy rabbitmq instances by adding the anyuid permission to the rabbitmq serviceaccount (this is the workaround from back when the images couldn't run under an arbitrary user id) and after this change all rabbitmq pods are blocked because they set the seccomp profile.

I understand the OpenShift security model is a bit awkward, but that caught us by surprise in a minor version operator update.

@Zerpet
Copy link
Member

Zerpet commented Dec 9, 2025

That's quite surprising because I explicitly tested this change in Openshift Local (former Code Ready Containers), and I only encountered an problem with the run as user and fs group, as described here: #1961 (comment)

However, after applying the recommendation for openshift in our docs [1], RabbitMQ started without any issues. From your comment, I understand that you were applying another workaround and this change caused an unexpected breakage.

I still think that a minor version is appropriate because it doesn't break upstream Kubernetes, our main audience, and OCP users applying the documented workaround shouldn't experience breakage.

I'll add some notes to the release notes to mention this potential pitfall.

[1] https://www.rabbitmq.com/kubernetes/operator/using-on-openshift#arbitrary-user-ids

@bo0ts
Copy link

bo0ts commented Dec 9, 2025

You would have gotten to the seccomp problem if you had only removed the problematic values instead of the entire security context eventually. The biggest problem in transitioning from the old workaround to the now recommended is, that once you start the rabbitmq instances with user 999 they refuse to come up when switching to the randomized uid. Any idea how to fix that?

The main issue is, that our workaround kind of used to be the recommended method once upon a time: rabbitmq/rabbitmq-website@a2180ef

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve RabbitmqCluster default securitycontext

5 participants