Seeking help with network partitions-induced operational problems #11073

kulbalgis · 2024-04-24T14:29:57Z

kulbalgis
Apr 24, 2024

Hello RabbitMQ Community,

I am reaching out for guidance regarding several pressing issues we have been experiencing in our organization. We have an extensive RabbitMQ setup, with over 600 clusters of 3 nodes each, spread across different datacenters. Currently, we are running RabbitMQ 3.11.20 on Erlang 25.3.2.4 on VMs and are preparing to upgrade to 3.13.

We have been facing three main issues:

Network Partitions: We encounter network partitions daily across all clusters. Often, the root cause is unknown, but we've observed that even a brief 2-second VM pause for snapshotting triggers it. We've set net_ticktime to 120, which has improved the situation. We initially used the pause_minority option but switched to autoheal as it proved to be less destructive.

Quorum Queues: Post-network partition, we've noticed an increase in segment files for quorum queues, occupying hundreds of gigabytes of disk space and causing the quorum queue to enter a NaN state. During this segment file buildup, we usually experience consumer connection failures, including complete crashes, timeouts, or an inability to consume from the queue entirely.

Classic Queues: After network partitions, classic queues frequently go out of sync. Despite having policies for auto sync, queue sync via the admin console is unresponsive. The only workaround we've found is to restart the nodes that are in the sync minority.

Due to these ongoing issues, our management is starting to believe that RabbitMQ might not be capable of running stably in our environment. We have even begun migrating to Kafka.

We would greatly appreciate any insights or suggestions for possible fixes to these issues. We believe in the power of RabbitMQ and want to ensure we're using it to its fullest potential.

Thank you for your time and assistance.

Answered by michaelklishin

Apr 24, 2024

@kulbalgis in general, the core team cannot suggest much where all you have a couple of sentences for each problem definition. We work with complete logs from all nodes and relevant metrics of multiple kinds.

Please upgrade to 3.13.1 and stop using classic mirrored queues ASAP, and disable "VM snapshotting" which will result in nodes (correctly) observing their peers becoming unavailable for a period of time.

You may also consider using Khepri with 3.13.x, which is an experimental feature but its primary objective is reliable (same as QQs and streams) Raft-based recovery and data safety of the schema data store. But that's optional, some observe meaningful improvements with certain operat…

View full answer

michaelklishin · 2024-04-24T14:44:36Z

michaelklishin
Apr 24, 2024
Maintainer

RabbitMQ 3.11 is out of community support. If you have a commercial support subscription, please file a support ticket instead.

There have dozens important changes around quorum queues since 3.11.x, see release notes. #7370 is one good example.
rabbitmq/ra#428 is an example of a change that will go into 3.13.x but not 3.11.x (it uses a Raft implementation series from Sep 2022 when 3.11.0 was first released).

Classic mirrored queue behavior will not be investigated because that feature has been deprecated for years and is scheduled for removal in RabbitMQ 4.x (and main already targets 4.x).

0 replies

michaelklishin · 2024-04-24T14:50:57Z

michaelklishin
Apr 24, 2024
Maintainer

@kulbalgis in general, the core team cannot suggest much where all you have a couple of sentences for each problem definition. We work with complete logs from all nodes and relevant metrics of multiple kinds.

Please upgrade to 3.13.1 and stop using classic mirrored queues ASAP, and disable "VM snapshotting" which will result in nodes (correctly) observing their peers becoming unavailable for a period of time.

You may also consider using Khepri with 3.13.x, which is an experimental feature but its primary objective is reliable (same as QQs and streams) Raft-based recovery and data safety of the schema data store. But that's optional, some observe meaningful improvements with certain operational issues, as well as fairly subtle behavior change that may or may not affect your applications.

When Khepri becomes the default in 4.x, the partition handling strategies will also be removed. Which should make RabbitMQ recovery from network failures fundamentally no different from other Raft-based systems (in terms of predictability, data safety and quorum availability).

1 reply

lukebakken Apr 25, 2024
Maintainer

disable "VM snapshotting"

Seconded. Please stop doing this. Do not pause your VMs.

michaelklishin · 2024-04-24T15:03:42Z

michaelklishin
Apr 24, 2024
Maintainer

Finally, I cannot tell what "N clusters spread across different data centers" means specifically. Nodes within a RabbitMQ cluster must not be placed in different data centers unless they have stable and very low latency. For inter-cluster communication, there are Federation, Shovels and the Warm Standby Replication feature in the commercial edition.

I have missed this at first and this is very important: "VM snapshotting" is an extremely likely cause of what RabbitMQ observes as a network partition, and Raft-based features observe as a situation where Raft-based features may end up with a leader election, which clients may observe as a delay of certain operations on the queues. Usually it should last fractions of a second or single digit seconds if your nodes have enough resources.

Clusters spanning data centers and regions are going to experience a lot more partitions, both real and false positives. RabbitMQ logs a certain type of warnings when its runtime reports a network communication delay, plus directly relevant Grafana dashboards to be used with [Prometheus or other compatible tools](https://www.rabbitmq.com/docs/prometheus.

2 replies

kulbalgis Apr 25, 2024
Author

@michaelklishin Thank you for your reply; it was very informative. Does this mean that RabbitMQ can be partitioned even if the net_ticktime value is relatively high? If it extends network partition detection using net ticks, why does a one-second VM freeze trigger it?

michaelklishin Apr 25, 2024
Maintainer

Raft-based features such as QQs do not rely on net_ticktime, they use aten, which by definition is more sensitive to latency spikes. And when a VM is "stunned" (frozen) for snapshotting, that's a guaranteed latency spike for all QQ and stream replicas on that node.

Seeking help with network partitions-induced operational problems #11073

Uh oh!

kulbalgis Apr 24, 2024

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

michaelklishin Apr 24, 2024 Maintainer

Uh oh!

Uh oh!

michaelklishin Apr 24, 2024 Maintainer

Uh oh!

lukebakken Apr 25, 2024 Maintainer

Uh oh!

Uh oh!

michaelklishin Apr 24, 2024 Maintainer

Uh oh!

kulbalgis Apr 25, 2024 Author

Uh oh!

michaelklishin Apr 25, 2024 Maintainer

kulbalgis
Apr 24, 2024

Replies: 3 comments 3 replies

michaelklishin
Apr 24, 2024
Maintainer

michaelklishin
Apr 24, 2024
Maintainer

lukebakken Apr 25, 2024
Maintainer

michaelklishin
Apr 24, 2024
Maintainer

kulbalgis Apr 25, 2024
Author

michaelklishin Apr 25, 2024
Maintainer