Finding the root cause of leadership election #8693
Replies: 3 comments 3 replies
-
RabbitMQ 3.9 is out of community support. If you have a paid support subscription with VMware, please file a ticket. You could have easily tried what is or isn't logged, debug logging can be enabled temporarily using CLI tools, see rabbitmqctl help This is what it logged at debug log level for a quorum queue:
RabbitMQ nodes cannot know the root cause, all they can tell is that Even 3.9.29 ships with Ra (our Raft implementation) version 2.0, while the latest is 2.6. |
Beta Was this translation helpful? Give feedback.
-
Thank you anyway for your answer @michaelklishin . I could have tried the debug logging, but as we are already facing performance issues I didn't want to take the risk and, as said, we're not facing the issue on our test environment. |
Beta Was this translation helpful? Give feedback.
-
Thank you @michaelklishin . Well, that's stating the obvious when talking about the difference between an acceptance and production environment. One other thing we tried in the meantime is disabling the virus scanner (Trellix/McAfee). When we do this the number of leadership changes drops to almost 0. We've double checked that we excluded the RabbitMQ related directories and ran a debugger on the on-access scan and this al looks fine: no RabbitMQ files are touched by the on-access scanner. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We have a 3-node cluster running RabbitMQ 3.9.15 and are experiencing up to 40 leadership changes each day on our Quorum Queues. I would like to find the root cause of these leader changes. According to the RAFT documentation there are three potential root causes:
Fail-stop (crash): A cluster participant does not recover from a crash and cannot rejoin the cluster.
Fail-recover: A cluster participant leaves the cluster, but returns after an arbitrary time. This can be caused by processing delays, networking delays, or network partition.
Network partition: A cluster participant is separated from the others due to a failure in (part of) the network or a change in network topology.
Byzantine failure: A cluster participant shows arbitrary or malicious behavior and sends contradictory conflicting data to other participants.
The normal log level (info) does not include the reason for the leader election. In general, I observe the following log lines:
[...] candidate -> leader in term: 324 machine version: 1[...]
[...] granting vote for [...]
[...] detected a new leader [...]
[...] leader saw request_vote_rpc from [...]
We only see this behavior on our production system. I'm thinking of turning on the debug logging temporarily, but want to make sure that I achieve my goal. Does debug logging include the reason why a leader election was started?
Beta Was this translation helpful? Give feedback.
All reactions