-
Notifications
You must be signed in to change notification settings - Fork 307
Description
Describe the bug
Context: I have a RabbitmqCluster resources (configured with 3 replicas in our case)
When a pod of the resulting stateful set is in Terminating state, the status of the RabbitmqCluster resource does not change, even if the pod is stuck in terminating.
For example, this can happen if the pod deletion is "blocked" (for good reason) by the pre stop rabbitmq-diagnostics check_if_node_is_quorum_critical command.
This is a real problem when deploying RabbitmqCluster through argocd, as the application will be seen as healthy but one pod is currently stuck in terminating. The RabbitmqCluster resource should reflect the status of the pod/stateful set.
To Reproduce
-
Create a
RabbitmqCluster(with 3 replicas for example) -
Then, have one pod of the stateful set in terminating status. In our case, we can reproduce it easily by removing one member from one of our quorum queue (with
rabbitmq-queues delete_member), and then then delete one of the two pods that is remaining in the queue.
Ex: if I have a queue replicated onrabbitmq-server-0,rabbitmq-server-1andrabbitmq-server-2withrabbitmq-server-2as leader, Idelete_member rabbitmq-server-1and then deleterabbitmq-server-0 -
The
rabbitmq-server-0pod deletion will be "blocked" by therabbitmq-diagnostics check_if_node_is_quorum_critical(which is expected)
Then check the status of the pods:
NAME READY STATUS RESTARTS AGE
rabbitmq-server-0 1/1 Terminating 0 11d
rabbitmq-server-1 1/1 Running 0 11d
rabbitmq-server-2 1/1 Running 0 11d
And check the status of the RabbitmqCluster resource:
status:
binding:
name: rabbitmq-default-user
conditions:
- lastTransitionTime: '2025-10-24T12:27:04Z'
reason: AllPodsAreReady
status: 'True'
type: AllReplicasReady
- lastTransitionTime: '2025-09-13T11:22:32Z'
reason: AtLeastOneEndpointAvailable
status: 'True'
type: ClusterAvailable
- lastTransitionTime: '2025-09-09T16:25:40Z'
reason: NoWarnings
status: 'True'
type: NoWarnings
- lastTransitionTime: '2025-10-24T11:05:13Z'
message: Finish reconciling
reason: Success
status: 'True'
type: ReconcileSuccess
defaultUser:
secretReference:
keys:
password: password
username: username
name: rabbitmq-default-user
namespace: [REDACTED]
serviceReference:
name: rabbitmq
namespace: [REDACTED]
observedGeneration: 11
Expected behavior
I would expect the RabbitmqCluster resource status condition to reflect the status of the terminating pod, with, at least, the AllReplicasReady condition set to false
Version and environment information
- RabbitMQ: 4.1.4
- RabbitMQ Cluster Operator: 2.16.0
- Kubernetes: v1.32.2
- Cloud provider or hardware configuration: Talos Linux