Identify Root Cause for Partition Issue prematurely #4911
Unanswered
mohankumar27
asked this question in
Questions
Replies: 1 comment 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
RabbitMQ version : 3.9.8
Erlang version : 23.2.7
Hi everyone.
We recently faced partition issue in our 7-node rabbitmq cluster. A partition occurred due to high CPU usage in one of our RabbitMQ nodes and it became unresponsive. We later fixed the issue manually by killing the rabbitmq process on the unresponsive node and doing a rolling restart of all the nodes.
I was wondering if there was any way to determine the root cause of the partition issue from RabbitMQ logs, Erlang metrics, or any prometheus exposed metrics as to why that specific node had high CPU usage and became unresponsive, resulting in partition. Not only with high CPU usage, but also with any other circumstances that may result in a partition.
Also to add on, the node that became unresponsive had all the shovels in that cluster running. There were nearly 60+ shovels.
Is there any way to identify a partition prematurely so that we can act upon it accordingly as we have not setup any auto heal process?
Any help regarding this would be appreciated.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions