Finding the root cause of leadership election #8693

marioverhaeg · 2023-06-29T08:19:10Z

marioverhaeg
Jun 29, 2023

Hi,
We have a 3-node cluster running RabbitMQ 3.9.15 and are experiencing up to 40 leadership changes each day on our Quorum Queues. I would like to find the root cause of these leader changes. According to the RAFT documentation there are three potential root causes:
Fail-stop (crash): A cluster participant does not recover from a crash and cannot rejoin the cluster.
Fail-recover: A cluster participant leaves the cluster, but returns after an arbitrary time. This can be caused by processing delays, networking delays, or network partition.
Network partition: A cluster participant is separated from the others due to a failure in (part of) the network or a change in network topology.
Byzantine failure: A cluster participant shows arbitrary or malicious behavior and sends contradictory conflicting data to other participants.

The normal log level (info) does not include the reason for the leader election. In general, I observe the following log lines:
[...] candidate -> leader in term: 324 machine version: 1[...]
[...] granting vote for [...]
[...] detected a new leader [...]
[...] leader saw request_vote_rpc from [...]

We only see this behavior on our production system. I'm thinking of turning on the debug logging temporarily, but want to make sure that I achieve my goal. Does debug logging include the reason why a leader election was started?

michaelklishin · 2023-06-29T09:33:41Z

michaelklishin
Jun 29, 2023
Maintainer

RabbitMQ 3.9 is out of community support. If you have a paid support subscription with VMware, please file a ticket.

You could have easily tried what is or isn't logged, debug logging can be enabled temporarily using CLI tools, see

rabbitmqctl help

This is what it logged at debug log level for a quorum queue:

2023-06-29 13:20:27.439499+04:00 [debug] <0.15864.0> Will start up to 3 replicas for quorum queue 'qq.1' in vhost '/' with leader on node 'rabbit@sunnyside'
2023-06-29 13:20:27.450997+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': ra_log:init recovered last_index_term {0,0} first index 0
2023-06-29 13:20:27.457233+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': post_init -> recover in term: 0 machine version: 3
2023-06-29 13:20:27.457400+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': recovering state machine version 0:3 from index 0 to 0
2023-06-29 13:20:27.457455+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': recovery of state machine version 0:3 from index 0 to 0 took 0ms
2023-06-29 13:20:27.457608+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': recover -> recovered in term: 0 machine version: 3
2023-06-29 13:20:27.457690+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': recovered -> follower in term: 0 machine version: 3
2023-06-29 13:20:27.457835+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': election triggered by <0.15864.0>
2023-06-29 13:20:27.457879+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': pre_vote election called for in term 0
2023-06-29 13:20:27.460380+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': follower -> pre_vote in term: 0 machine version: 3
2023-06-29 13:20:27.460649+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': pre_vote granted #Ref<0.1436509017.2662858753.94863> for term 0 votes 1
2023-06-29 13:20:27.460740+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': election called for in term 1
2023-06-29 13:20:27.465015+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': pre_vote -> candidate in term: 1 machine version: 3
2023-06-29 13:20:27.465135+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': vote granted for term 1 votes 1
2023-06-29 13:20:27.465202+04:00 [notice] <0.15869.0> queue 'qq.1' in vhost '/': candidate -> leader in term: 1 machine version: 3
2023-06-29 13:20:27.465459+04:00 [info] <0.15864.0> ra: started cluster %2F_qq.1 with 1 servers
2023-06-29 13:20:27.465459+04:00 [info] <0.15864.0> 0 servers failed to start: []
2023-06-29 13:20:27.465459+04:00 [info] <0.15864.0> Leader: {'%2F_qq.1',rabbit@sunnyside}
2023-06-29 13:20:27.466509+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': enabling ra cluster changes in 1, index 1
2023-06-29 13:20:27.466576+04:00 [debug] <0.15869.0> queue 'qq.1' in vhost '/': applying new machine version 3 current 0

RabbitMQ nodes cannot know the root cause, all they can tell is that aten has indicated that there was no activity from the leader,
so speculatively it must be down.

Even 3.9.29 ships with Ra (our Raft implementation) version 2.0, while the latest is 2.6.
There are several relevant changes to leader election stability, including false positives:

0 replies

marioverhaeg · 2023-06-29T11:34:24Z

marioverhaeg
Jun 29, 2023
Author

Thank you anyway for your answer @michaelklishin . I could have tried the debug logging, but as we are already facing performance issues I didn't want to take the risk and, as said, we're not facing the issue on our test environment.

1 reply

michaelklishin Jul 3, 2023
Maintainer

With this amount of information, I can only recommend upgrading and/or examining the differences between the two environments. One obvious one may be the load and data transfer volumes.

marioverhaeg · 2023-07-04T04:16:53Z

marioverhaeg
Jul 4, 2023
Author

Thank you @michaelklishin . Well, that's stating the obvious when talking about the difference between an acceptance and production environment. One other thing we tried in the meantime is disabling the virus scanner (Trellix/McAfee). When we do this the number of leadership changes drops to almost 0. We've double checked that we excluded the RabbitMQ related directories and ran a debugger on the on-access scan and this al looks fine: no RabbitMQ files are touched by the on-access scanner.

2 replies

michaelklishin Jul 4, 2023
Maintainer

Virus scanners are a never ending source of obscure failures. Time to add a note to various doc guides.

marioverhaeg Jul 4, 2023
Author

Thank you for that comment. I will start searching for information. We already excluded the typical directories from on-access scanning but that didn't seem to result in an improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finding the root cause of leadership election #8693

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Finding the root cause of leadership election #8693

Uh oh!

marioverhaeg Jun 29, 2023

Replies: 3 comments · 3 replies

Uh oh!

michaelklishin Jun 29, 2023 Maintainer

Uh oh!

marioverhaeg Jun 29, 2023 Author

Uh oh!

michaelklishin Jul 3, 2023 Maintainer

Uh oh!

marioverhaeg Jul 4, 2023 Author

Uh oh!

michaelklishin Jul 4, 2023 Maintainer

Uh oh!

marioverhaeg Jul 4, 2023 Author

marioverhaeg
Jun 29, 2023

Replies: 3 comments 3 replies

michaelklishin
Jun 29, 2023
Maintainer

marioverhaeg
Jun 29, 2023
Author

michaelklishin Jul 3, 2023
Maintainer

marioverhaeg
Jul 4, 2023
Author

michaelklishin Jul 4, 2023
Maintainer

marioverhaeg Jul 4, 2023
Author