network partitioning breakage with release 3.13.6 #11847

hillen · 2024-07-28T01:05:29Z

hillen
Jul 28, 2024

I had the following issue with release 3.13.4 from this discussion number: 11712

When 3.13.5 was released, everything seemed to have been fixed. At least with the few updates that I tried.

Now, with updating from 3.13.5 to 3.13.6 I am getting a different breakage where the cluster goes to network partitioning. This is after the upgrade when doing a rollout restart of the statefulset.

I am seeing lines like this in the attached logs:
2024-07-28 00:38:25.963849+00:00 [error] <0.1115.0> Mnesia('[email protected]'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '[email protected]'}
2024-07-28 00:38:25.963849+00:00 [error] <0.1115.0>
2024-07-28 00:38:25.965275+00:00 [info] <0.1711.0> Autoheal request sent to '[email protected]'

The strange thing is that the cluster rebalance shows that it rebalanced across all the nodes in the cluster and running rabbitmq-queues quorum_status command on all rabbitmq nodes shows consistent tables.

The only way that I have found to get around this is to just restart the nodes that the portal shows as having network partitioning from rabbitmq-0.

I am attaching the logs from all three nodes and a screen capture of the network partitioning message in the portal.
rabbitmq-2.txt
rabbitmq-1.txt
rabbitmq-0.txt

michaelklishin · 2024-07-28T21:38:01Z

michaelklishin
Jul 28, 2024
Maintainer

inconsistent_database is a message from Mnesia and has nothing to do with quorum queues or the Raft changes in 3.13.6. So most likely this is something that has coincided with a node restart during the upgrade.

We don't have enough specific to pin point the root cause. The only fundamental solution is to use Khepri, which will be considered a mature option by 4.0 this fall.

2 replies

hillen Jul 28, 2024
Author

I agree that it does not appear to be a quorum queue or Raft issue.

But, I have never seen this before the upgrade to 3.13.6. And it does not only happen just during the upgrade. After I restart the nodes that have network partitioning reported which clears the network partitioning, I can do a kubectl rollout restart statefulset rabbitmq -n rabbitmq command and it has happened the few times that I have tried. The particular node(s) that report this is not consistent.

It seems like this is a false reporting of the network partitioning based on what is returned from the quorum queue state and rebalancing command.

After I started this discussion, I did notice that discussion number 11154 is also reporting a similar issue. This was with 3.13.0. I have not noticed it in any of our six clusters until I upgraded to 3.13.6 and I stopped updating after the one cluster. In one environment, I have a 3.13.5 cluster that does not show this behavior and a 3.13.6 cluster that does that are both setup identically.

michaelklishin Jul 29, 2024
Maintainer

The fact that you have never seen a timing-sensitive behavior before does not prove anything about 3.13.6. RabbitMQ is open source software, you can take a look at the difference between 3.13.5 (or any earlier version) and 3.13.6. Mnesia's highly idiosyncratic way of detecting partitions and problematic ability to recover from them — which is why network partition handling in RabbitMQ exists in the first place — have been known for more than a decade. They are exactly the reason why we are replacing Mnesia entirely with Khepri.

Like I said, 4.0 will ship in a few months with matured Khepri. You can adopt Khepri today but the only upgrade option to 4.0 will then be a Blue/Green Deployment upgrade.

One hypothesis I have is that after an upgarde, a bunch of QQs and/or streams had to transfer a lot of data to (or maybe from) one node and that has resulted in an inter-node communication buffer back-up for a while (say, 30-90 seconds), which trips up Mnesia.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

network partitioning breakage with release 3.13.6 #11847

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

network partitioning breakage with release 3.13.6 #11847

Uh oh!

hillen Jul 28, 2024

Replies: 1 comment · 2 replies

Uh oh!

michaelklishin Jul 28, 2024 Maintainer

Uh oh!

hillen Jul 28, 2024 Author

Uh oh!

Uh oh!

michaelklishin Jul 29, 2024 Maintainer

hillen
Jul 28, 2024

Replies: 1 comment 2 replies

michaelklishin
Jul 28, 2024
Maintainer

hillen Jul 28, 2024
Author

michaelklishin Jul 29, 2024
Maintainer