Identify Root Cause for Partition Issue prematurely #4911

mohankumar27 · 2022-05-29T13:41:47Z

mohankumar27
May 29, 2022

RabbitMQ version : 3.9.8

Erlang version : 23.2.7

Hi everyone.

We recently faced partition issue in our 7-node rabbitmq cluster. A partition occurred due to high CPU usage in one of our RabbitMQ nodes and it became unresponsive. We later fixed the issue manually by killing the rabbitmq process on the unresponsive node and doing a rolling restart of all the nodes.

I was wondering if there was any way to determine the root cause of the partition issue from RabbitMQ logs, Erlang metrics, or any prometheus exposed metrics as to why that specific node had high CPU usage and became unresponsive, resulting in partition. Not only with high CPU usage, but also with any other circumstances that may result in a partition.

Also to add on, the node that became unresponsive had all the shovels in that cluster running. There were nearly 60+ shovels.

Is there any way to identify a partition prematurely so that we can act upon it accordingly as we have not setup any auto heal process?

Any help regarding this would be appreciated.

Thanks

michaelklishin · 2022-05-29T16:46:42Z

michaelklishin
May 29, 2022
Maintainer

rabbitmq-top can provide a breakdown of CPU/runtime resource use on the node at the current moment but that information is not collected historically. The closest option are 1) Other metrics that might indicate correlating spikes or increases and 2) certain runtime metrics exposed via Prometheus. I think 2) might include scheduler metrics but I haven't looked at them for a while.

1 reply

mohankumar27 May 30, 2022
Author

Hi Michael,

Thank you for your quick response. I will analyse the metrics and let you know if I have any findings there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Identify Root Cause for Partition Issue prematurely #4911

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Identify Root Cause for Partition Issue prematurely #4911

Uh oh!

mohankumar27 May 29, 2022

Replies: 1 comment · 1 reply

Uh oh!

michaelklishin May 29, 2022 Maintainer

Uh oh!

mohankumar27 May 30, 2022 Author

mohankumar27
May 29, 2022

Replies: 1 comment 1 reply

michaelklishin
May 29, 2022
Maintainer

mohankumar27 May 30, 2022
Author