rabbitmq helm chart 3 replicas #5702

vjvel · 2022-09-01T11:24:17Z

vjvel
Sep 1, 2022

Hi,

We deployed rabbitmq 3 replica pod using bitnami chart and version is 3.10.5. If any pod is down out of 3 pods, cluster queue is not accessible. I am not sure how HA will work..

we enabled auto_heal, rebalance but because of any one pod eviction queue is not handled by other 2 pods. it gives error and once the failed pod joined the cluster then only it works normally.

How can I make things work when 1 pod is failed in 3 pod rabbitmq cluster ?

Answered by coro

Sep 1, 2022

In our production-ready example using the operator (i.e. not the bitnami chart), we set the partition handling strategy to pause_minority, so that the majority partition continues to function when the minority becomes uncontactable. In your case, that would mean that the other two Pods continue to serve traffic, and the Service that load balances across them would only redirect traffic to those two Pods.

It looks like the chart uses autoheal by default. Autoheal takes effect when a partition is recovered, rather than when it is detected, so that might explain why you're not seeing high availability using that strategy, as it's designed to aid consistency not availability.

If you prefer av…

View full answer

michaelklishin · 2022-09-01T12:48:19Z

michaelklishin
Sep 1, 2022
Maintainer

We cannot suggest much without more details. Quorum queues and streams tolerate failures of a minority of replicas.

However, we don't know what partition handling strategies this chart may be using. They can get in the way of the standard Raft recovery and leader election procedures used by QQs and streams. By 4.0, those partition handling strategies will be gone but right now, they play a role, positive or negative.

You can also deploy things on Kubernetes in a way that restarts all pods when one of them goes down. Obviously that would be completely ill-fit for a distributed stateful data services such as RabbitMQ.

See server logs and effective node configuration for clues.

0 replies

coro · 2022-09-01T13:48:24Z

coro
Sep 1, 2022

In our production-ready example using the operator (i.e. not the bitnami chart), we set the partition handling strategy to pause_minority, so that the majority partition continues to function when the minority becomes uncontactable. In your case, that would mean that the other two Pods continue to serve traffic, and the Service that load balances across them would only redirect traffic to those two Pods.

It looks like the chart uses autoheal by default. Autoheal takes effect when a partition is recovered, rather than when it is detected, so that might explain why you're not seeing high availability using that strategy, as it's designed to aid consistency not availability.

If you prefer availability, I would recommend pause_minority as a partition handling strategy instead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rabbitmq helm chart 3 replicas #5702

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

rabbitmq helm chart 3 replicas #5702

Uh oh!

vjvel Sep 1, 2022

Replies: 2 comments

Uh oh!

michaelklishin Sep 1, 2022 Maintainer

Uh oh!

Uh oh!

coro Sep 1, 2022

vjvel
Sep 1, 2022

michaelklishin
Sep 1, 2022
Maintainer

coro
Sep 1, 2022