3.11: two Raft replicas are in the timeout state, one is a candidate #10934

gsingla294 · 2024-04-08T04:41:42Z

gsingla294
Apr 8, 2024

Describe the bug

We have a 3 node MQ cluster deployed on Azure AKS with MQ version 3.11.24
But repeatedly 2 out of 3 nodes goes into raft state as timeout and consumers stop receiving the data and effectively a downtime occurs.
This is occurring very frequently. Need guidance on how to troubleshoot.

Reproduction steps

Create a 3 node MQ cluster on AKS
I might not have steps to reproduce it as there is not specific pattern and it is happening randomly.

Expected behavior

Cluster should work well and raft state should not go into timeout state.

Additional context

No response

Answered by michaelklishin

Apr 8, 2024

RabbitMQ 3.11 is out of community support. There will be no guidance besides the absolute basics.

Please upgrade to 3.13.1
See logs from all nodes for clues
Consider how the nodes were added or removed. For example, the following upgrade strategy is very Raft-unfriendly, and requires extra care during the upgrade process (covered in the docs) when nodes leave the cluster

Node identity matters for Raft-based features, so if you (or AKS) remove a node, it must be done explicitly, simply nuking a pod is not good enough, or rather it eventually may have side effects on the rest of the cluster, similar to what you are observing: the replicas cannot elect a leader for one reason or another.
I…

View full answer

michaelklishin · 2024-04-08T04:45:48Z

michaelklishin
Apr 8, 2024
Maintainer

RabbitMQ 3.11 is out of community support. There will be no guidance besides the absolute basics.

Please upgrade to 3.13.1
See logs from all nodes for clues
Consider how the nodes were added or removed. For example, the following upgrade strategy is very Raft-unfriendly, and requires extra care during the upgrade process (covered in the docs) when nodes leave the cluster

Node identity matters for Raft-based features, so if you (or AKS) remove a node, it must be done explicitly, simply nuking a pod is not good enough, or rather it eventually may have side effects on the rest of the cluster, similar to what you are observing: the replicas cannot elect a leader for one reason or another.
If this hypothesis turns out to be correct, then 3.13 will largely suffer from the same behavior eventually.

#10786 is one example of how removing nodes aggressively during a grow-then-shrink upgrade can lead to certain queues or streams not being able to elect a new leader because their original peers are gone without explicit removal.

Please take it from here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3.11: two Raft replicas are in the timeout state, one is a candidate #10934

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

3.11: two Raft replicas are in the timeout state, one is a candidate #10934

Uh oh!

gsingla294 Apr 8, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 1 comment

Uh oh!

Uh oh!

michaelklishin Apr 8, 2024 Maintainer

gsingla294
Apr 8, 2024

michaelklishin
Apr 8, 2024
Maintainer