Replies: 1 comment 4 replies
-
Great - what would help is is if you provided a docker compose project and a script that executes the exact commands necessary to reproduce the issue. Right now you're asking us to guess. Provide a git repository I can clone, with instructions. Thanks.
Please note that the version of RabbitMQ you're using is not supported, unless you are a paying customer with extended support - https://www.rabbitmq.com/release-information Having said that, we may take the time to investigate if you help us help you, as I explained above. Also, please note that the metadata storage engine in RabbitMQ 4.0 should not be susceptible to this issue. In fact, you can try that storage engine using RabbitMQ 3.13 now - https://www.rabbitmq.com/blog/2024/03/11/rabbitmq-3.13.0-announcement#experimental-support-for-khepri-mnesia-replacement What would help us out GREATLY is if you can confirm that, by using Khepri, you do NOT see this issue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi RabbitMQ Team,
I want to start a discussion about a problem we recently experienced with our RabbitMQ cluster.
Our setup
Description of issue
At some point the cluster entered into a partial partition from which it recovered automatically (we think it was just a transient network issue) however afterwards we discovered that many queue bindings to exchanges were left in a broken state i.e they appeared OK in the management DB and UI with active consumer but no messages were being routed to them by the exchange.
Worse still, RabbitMQ did not report anything via the prometheus interface that something was wrong.
The issue was resolved after we restarted the cluster nodes.
We are using exclusive server named queues which are re-declared on another node if their home node goes down. Since the two nodes that reported a partial partition have restarted, the broken bindings must have been the ones re-created after the cluster recovered from the partition.
Please see attached relevant server logs that capture the partition event.
We have since tried to reproduce the issue again and found out that it is actually quite easy to reproduce.
Simulating a couple of partial network partitions in the cluster (by way of IP table rules) we could get into the same situation after a few tries.
We think this is some sort of race condition (in Mnesia DB?) where if a queue binding is created just when the cluster data is reconciled from a partition, some internal state could get corrupted. At least this is our intuition.
I would kindly ask you for some help or feedback here because this issue is important for our project as we are operating in a safety critical environment where the messaging broker MUST always either recover automatically from disruptions or at the very least report that something is wrong. Currently we have neither of these.
Is this a know issue you are aware of? Or is it a limitation of the solution in partition recovery?
Log reference: logs.txt
Beta Was this translation helpful? Give feedback.
All reactions