(Mnesia) RabbitMQ queue bindings can get into a broken state and never recover on their own #10888

Rmarian · 2024-03-30T11:45:56Z

Rmarian
Mar 30, 2024

Hi RabbitMQ Team,

I want to start a discussion about a problem we recently experienced with our RabbitMQ cluster.

Our setup

RabbitMQ version 3.11.19
Erlang 25.3.2.2
dockerized three node cluster with pause minority set

Description of issue

At some point the cluster entered into a partial partition from which it recovered automatically (we think it was just a transient network issue) however afterwards we discovered that many queue bindings to exchanges were left in a broken state i.e they appeared OK in the management DB and UI with active consumer but no messages were being routed to them by the exchange.

Worse still, RabbitMQ did not report anything via the prometheus interface that something was wrong.

The issue was resolved after we restarted the cluster nodes.

We are using exclusive server named queues which are re-declared on another node if their home node goes down. Since the two nodes that reported a partial partition have restarted, the broken bindings must have been the ones re-created after the cluster recovered from the partition.

Please see attached relevant server logs that capture the partition event.

We have since tried to reproduce the issue again and found out that it is actually quite easy to reproduce.

Simulating a couple of partial network partitions in the cluster (by way of IP table rules) we could get into the same situation after a few tries.

We think this is some sort of race condition (in Mnesia DB?) where if a queue binding is created just when the cluster data is reconciled from a partition, some internal state could get corrupted. At least this is our intuition.

I would kindly ask you for some help or feedback here because this issue is important for our project as we are operating in a safety critical environment where the messaging broker MUST always either recover automatically from disruptions or at the very least report that something is wrong. Currently we have neither of these.

Is this a know issue you are aware of? Or is it a limitation of the solution in partition recovery?

Log reference: logs.txt

lukebakken · 2024-03-31T22:12:03Z

lukebakken
Mar 31, 2024
Maintainer

We have since tried to reproduce the issue again and found out that it is actually quite easy to reproduce.

Great - what would help is is if you provided a docker compose project and a script that executes the exact commands necessary to reproduce the issue. Right now you're asking us to guess.

Provide a git repository I can clone, with instructions. Thanks.

I would kindly ask you for some help or feedback here because this issue is important for our project as we are operating in a safety critical environment where the messaging broker MUST always either recover automatically from disruptions or at the very least report that something is wrong.

Please note that the version of RabbitMQ you're using is not supported, unless you are a paying customer with extended support - https://www.rabbitmq.com/release-information

Having said that, we may take the time to investigate if you help us help you, as I explained above.

Also, please note that the metadata storage engine in RabbitMQ 4.0 should not be susceptible to this issue. In fact, you can try that storage engine using RabbitMQ 3.13 now - https://www.rabbitmq.com/blog/2024/03/11/rabbitmq-3.13.0-announcement#experimental-support-for-khepri-mnesia-replacement

What would help us out GREATLY is if you can confirm that, by using Khepri, you do NOT see this issue.

4 replies

Rmarian Apr 11, 2024
Author

Hi @lukebakken ,

First, thank you for taking the time to reply!

I have since tested with Khepri DB in 3.13 and I could not reproduce the issue anymore however I tried. It also fixes two other issues we had in the past also related to partition recovery and such.

So this is really good news and we are looking already into switching over from Mnesia in our project.

However, during testing I found three new issues that I would like to share with you as they are probably related to the new database.

All problems below assume a three node RabbitMQ 3.13 cluster (Khperi enabled) with each node running in a docker container

Problem #1

This might actually be a feature though: when I isolate just one of the nodes from the rest of the cluster, so simulating a minority partition, this node no longer accepts operations that require RAFT consensus which is correct. However, it does not close it's active AMQP connections as Mnesia would do and this is problematic for us because we rely on our clients being disconnected so they can re-connect to the majority partition and continue working.

My question is, would you consider implementing the old behavior in Khepri as well: having an isolated node in a minority partition close all it's connections? Even under an optional flag/env var such a behavior would help us a lot and would maintain same behavior as before.

IMO this is a blocker for us in adopting Khepri as it is now.

Problem #2

During one test that repeatedly simulated a network partition between one node and the rest of the cluster, I noticed that, quite frequently, the isolated node would not automatically re-join the cluster on its own after the partition event was resolved. However, if I tried to declare a queue on it, it would immediately figure out that IT CAN and promptly do so and the queue would get create and replicated across the cluster.

Problem #3

During another test, where we would repeatedly stop and start two nodes at a time, I would occasionally see exclusive queues with auto delete and zero consumers linger around without being deleted automatically. I think that is wrong but not a big issue IMO

Summary

For problems #2 and #3, I can try to provide for you o dockized environment with a reproduction script but I need some time to provide a vannilla project because our current setup is heavily tied to existing company code/infrastructure.
Maybe give me a week or so and I will come back to you with a project to fork if you want.

For problem #1 I think we need to open a discussion since I view it as more of a change request/inquiry. But it is the most pressing for us to be addressed somehow.

Hope to hear from you soon,
Radu.

kjnilsson Apr 11, 2024
Maintainer

Reliably detecting a minority situation isn't something that is technically possible which is why khepri does not attempt to do this. A false positive detection that disconnected would be very annoying and also just because khepri is in minority it does not mean that the queues are. They could be classic queues in a 3 node cluster that would still work even when khepri is in minority. Even 3 member quorum queues in a 7 node cluster could still work in a khepri minority subset of nodes so this isn't something that would be straight forward to implement and if we did it would have to be something that connections would have to opt in to, not something that should happen by default.
This could probably be solved by a something periodically attempting reconnections to disconnected cluster members.
Exclusive queues need a quorum to be removed from the database. Did the queues get deleted after you stopped restarting the two nodes?

Rmarian Apr 11, 2024
Author

Reliably detecting a minority situation isn't something that is technically possible which is why khepri does not attempt to do this. A false positive detection that disconnected would be very annoying and also just because khepri is in minority it does not mean that the queues are. They could be classic queues in a 3 node cluster that would still work even when khepri is in minority. Even 3 member quorum queues in a 7 node cluster could still work in a khepri minority subset of nodes so this isn't something that would be straight forward to implement and if we did it would have to be something that connections would have to opt in to, not something that should happen by default.

Here is the scenario behind this request: We have a consumer connected to a node in a cluster. This node at some point becomes isolated for a long time from the cluster and the consumer no longer receives messages by a publisher that is connected to the majority partition. The question is, how can the consumer reconnect automatically to the remaining cluster? We are using RabbitMQ in a safety critical environment and need to quickly and automatically recover from such situations. I was thinking that if a node cannot contact it's peers for some time it should notify it's passive consumers about that. I tried using an active polling where every few seconds the client tries to do an operation that requires consensus but this is only a bad workaround. Any ideas or suggestions are welcome.

kjnilsson Apr 11, 2024
Maintainer

Your monitoring system could implement such a check itself without RabbitMQ having to do something that is specific to your requirement and topology. You don't even know if the publisher is on the other side of the partition so reconnecting elsewhere isn't necessarily the right thing to do.
I am not saying never - we may consider adding some kind of notification link to the AMQP (1.0) protocol in the future where consumers can register interest in their node's potential status in the cluster but doing something drastic like disconnecting all connections and/or shutting down the node isn't a good idea as it is pone to errors and could cause more availability issues than it fixes.

Partitions are very hard to reason about as in reality they can very well be nothing like the partitions that are tested artifically. Asking a system to automatically take action when it thinks there is a partition is very, very hard to do well and has been the cause of many, many RabbitMQ issues in the past.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(Mnesia) RabbitMQ queue bindings can get into a broken state and never recover on their own #10888

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

(Mnesia) RabbitMQ queue bindings can get into a broken state and never recover on their own #10888

Uh oh!

Rmarian Mar 30, 2024

Our setup

Description of issue

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

lukebakken Mar 31, 2024 Maintainer

Uh oh!

Uh oh!

Rmarian Apr 11, 2024 Author

Problem #1

Problem #2

Problem #3

Summary

Uh oh!

kjnilsson Apr 11, 2024 Maintainer

Uh oh!

Rmarian Apr 11, 2024 Author

Uh oh!

kjnilsson Apr 11, 2024 Maintainer

Rmarian
Mar 30, 2024

Replies: 1 comment 4 replies

lukebakken
Mar 31, 2024
Maintainer

Rmarian Apr 11, 2024
Author

kjnilsson Apr 11, 2024
Maintainer

Rmarian Apr 11, 2024
Author

kjnilsson Apr 11, 2024
Maintainer