-
Hi, We deployed rabbitmq 3 replica pod using bitnami chart and version is 3.10.5. If any pod is down out of 3 pods, cluster queue is not accessible. I am not sure how HA will work.. we enabled auto_heal, rebalance but because of any one pod eviction queue is not handled by other 2 pods. it gives error and once the failed pod joined the cluster then only it works normally. How can I make things work when 1 pod is failed in 3 pod rabbitmq cluster ? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
We cannot suggest much without more details. Quorum queues and streams tolerate failures of a minority of replicas. However, we don't know what partition handling strategies this chart may be using. They can get in the way of the standard Raft recovery and leader election procedures used by QQs and streams. By 4.0, those partition handling strategies will be gone but right now, they play a role, positive or negative. You can also deploy things on Kubernetes in a way that restarts all pods when one of them goes down. Obviously that would be completely ill-fit for a distributed stateful data services such as RabbitMQ. See server logs and effective node configuration for clues. |
Beta Was this translation helpful? Give feedback.
-
In our production-ready example using the operator (i.e. not the bitnami chart), we set the partition handling strategy to It looks like the chart uses autoheal by default. Autoheal takes effect when a partition is recovered, rather than when it is detected, so that might explain why you're not seeing high availability using that strategy, as it's designed to aid consistency not availability. If you prefer availability, I would recommend pause_minority as a partition handling strategy instead. |
Beta Was this translation helpful? Give feedback.
In our production-ready example using the operator (i.e. not the bitnami chart), we set the partition handling strategy to
pause_minority
, so that the majority partition continues to function when the minority becomes uncontactable. In your case, that would mean that the other two Pods continue to serve traffic, and the Service that load balances across them would only redirect traffic to those two Pods.It looks like the chart uses autoheal by default. Autoheal takes effect when a partition is recovered, rather than when it is detected, so that might explain why you're not seeing high availability using that strategy, as it's designed to aid consistency not availability.
If you prefer av…