Skip to content

RabbitmqCluster resources status does not consider terminating pods #1993

@WinterNis

Description

@WinterNis

Describe the bug

Context: I have a RabbitmqCluster resources (configured with 3 replicas in our case)

When a pod of the resulting stateful set is in Terminating state, the status of the RabbitmqCluster resource does not change, even if the pod is stuck in terminating.

For example, this can happen if the pod deletion is "blocked" (for good reason) by the pre stop rabbitmq-diagnostics check_if_node_is_quorum_critical command.

This is a real problem when deploying RabbitmqCluster through argocd, as the application will be seen as healthy but one pod is currently stuck in terminating. The RabbitmqCluster resource should reflect the status of the pod/stateful set.

To Reproduce

  • Create a RabbitmqCluster (with 3 replicas for example)

  • Then, have one pod of the stateful set in terminating status. In our case, we can reproduce it easily by removing one member from one of our quorum queue (with rabbitmq-queues delete_member), and then then delete one of the two pods that is remaining in the queue.
    Ex: if I have a queue replicated on rabbitmq-server-0, rabbitmq-server-1 and rabbitmq-server-2 with rabbitmq-server-2 as leader, I delete_member rabbitmq-server-1 and then delete rabbitmq-server-0

  • The rabbitmq-server-0 pod deletion will be "blocked" by the rabbitmq-diagnostics check_if_node_is_quorum_critical (which is expected)

Then check the status of the pods:

NAME                READY   STATUS        RESTARTS   AGE
rabbitmq-server-0   1/1     Terminating   0          11d
rabbitmq-server-1   1/1     Running       0          11d
rabbitmq-server-2   1/1     Running       0          11d

And check the status of the RabbitmqCluster resource:

status:
  binding:
    name: rabbitmq-default-user
  conditions:
    - lastTransitionTime: '2025-10-24T12:27:04Z'
      reason: AllPodsAreReady
      status: 'True'
      type: AllReplicasReady
    - lastTransitionTime: '2025-09-13T11:22:32Z'
      reason: AtLeastOneEndpointAvailable
      status: 'True'
      type: ClusterAvailable
    - lastTransitionTime: '2025-09-09T16:25:40Z'
      reason: NoWarnings
      status: 'True'
      type: NoWarnings
    - lastTransitionTime: '2025-10-24T11:05:13Z'
      message: Finish reconciling
      reason: Success
      status: 'True'
      type: ReconcileSuccess
  defaultUser:
    secretReference:
      keys:
        password: password
        username: username
      name: rabbitmq-default-user
      namespace: [REDACTED]
    serviceReference:
      name: rabbitmq
      namespace: [REDACTED]
  observedGeneration: 11

Expected behavior

I would expect the RabbitmqCluster resource status condition to reflect the status of the terminating pod, with, at least, the AllReplicasReady condition set to false

Version and environment information

  • RabbitMQ: 4.1.4
  • RabbitMQ Cluster Operator: 2.16.0
  • Kubernetes: v1.32.2
  • Cloud provider or hardware configuration: Talos Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions