Seeking help with network partitions-induced operational problems #11073
-
Hello RabbitMQ Community, I am reaching out for guidance regarding several pressing issues we have been experiencing in our organization. We have an extensive RabbitMQ setup, with over 600 clusters of 3 nodes each, spread across different datacenters. Currently, we are running RabbitMQ 3.11.20 on Erlang 25.3.2.4 on VMs and are preparing to upgrade to 3.13. We have been facing three main issues: Network Partitions: We encounter network partitions daily across all clusters. Often, the root cause is unknown, but we've observed that even a brief 2-second VM pause for snapshotting triggers it. We've set net_ticktime to 120, which has improved the situation. We initially used the pause_minority option but switched to autoheal as it proved to be less destructive. Quorum Queues: Post-network partition, we've noticed an increase in segment files for quorum queues, occupying hundreds of gigabytes of disk space and causing the quorum queue to enter a NaN state. During this segment file buildup, we usually experience consumer connection failures, including complete crashes, timeouts, or an inability to consume from the queue entirely. Classic Queues: After network partitions, classic queues frequently go out of sync. Despite having policies for auto sync, queue sync via the admin console is unresponsive. The only workaround we've found is to restart the nodes that are in the sync minority. Due to these ongoing issues, our management is starting to believe that RabbitMQ might not be capable of running stably in our environment. We have even begun migrating to Kafka. We would greatly appreciate any insights or suggestions for possible fixes to these issues. We believe in the power of RabbitMQ and want to ensure we're using it to its fullest potential. Thank you for your time and assistance. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
RabbitMQ 3.11 is out of community support. If you have a commercial support subscription, please file a support ticket instead. There have dozens important changes around quorum queues since 3.11.x, see release notes. #7370 is one good example. Classic mirrored queue behavior will not be investigated because that feature has been deprecated for years and is scheduled for removal in RabbitMQ 4.x (and |
Beta Was this translation helpful? Give feedback.
-
@kulbalgis in general, the core team cannot suggest much where all you have a couple of sentences for each problem definition. We work with complete logs from all nodes and relevant metrics of multiple kinds. Please upgrade to 3.13.1 and stop using classic mirrored queues ASAP, and disable "VM snapshotting" which will result in nodes (correctly) observing their peers becoming unavailable for a period of time. You may also consider using Khepri with 3.13.x, which is an experimental feature but its primary objective is reliable (same as QQs and streams) Raft-based recovery and data safety of the schema data store. But that's optional, some observe meaningful improvements with certain operational issues, as well as fairly subtle behavior change that may or may not affect your applications. When Khepri becomes the default in 4.x, the partition handling strategies will also be removed. Which should make RabbitMQ recovery from network failures fundamentally no different from other Raft-based systems (in terms of predictability, data safety and quorum availability). |
Beta Was this translation helpful? Give feedback.
-
Finally, I cannot tell what "N clusters spread across different data centers" means specifically. Nodes within a RabbitMQ cluster must not be placed in different data centers unless they have stable and very low latency. For inter-cluster communication, there are Federation, Shovels and the Warm Standby Replication feature in the commercial edition. I have missed this at first and this is very important: "VM snapshotting" is an extremely likely cause of what RabbitMQ observes as a network partition, and Raft-based features observe as a situation where Raft-based features may end up with a leader election, which clients may observe as a delay of certain operations on the queues. Usually it should last fractions of a second or single digit seconds if your nodes have enough resources. Clusters spanning data centers and regions are going to experience a lot more partitions, both real and false positives. RabbitMQ logs a certain type of warnings when its runtime reports a network communication delay, plus directly relevant Grafana dashboards to be used with [Prometheus or other compatible tools](https://www.rabbitmq.com/docs/prometheus. |
Beta Was this translation helpful? Give feedback.
@kulbalgis in general, the core team cannot suggest much where all you have a couple of sentences for each problem definition. We work with complete logs from all nodes and relevant metrics of multiple kinds.
Please upgrade to 3.13.1 and stop using classic mirrored queues ASAP, and disable "VM snapshotting" which will result in nodes (correctly) observing their peers becoming unavailable for a period of time.
You may also consider using Khepri with 3.13.x, which is an experimental feature but its primary objective is reliable (same as QQs and streams) Raft-based recovery and data safety of the schema data store. But that's optional, some observe meaningful improvements with certain operat…