Replies: 2 comments 21 replies
-
There is a natural race condition between what clients and nodes do in response to a node failure. |
Beta Was this translation helpful? Give feedback.
-
I have filed a specific issue with a couple of ideas in mind #6274. A more attractive option for most people would be switch to quorum queues or streams with the smallest number of supported replicas (three). Channel operations in flight when a new leader |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have an issue where a transient queue can disappear that used to be on a node that went down and gets successfully declared, bind and start consuming from on a new node without the consumer knowing that the queue is gone.
From what I have gathered, this seems to happen between when the running nodes detect that a node is down with logs: “rabbit on node ‘[vhost-address-of-node-that-went-down]' down”
and when the running nodes have finished deleting old transient queues from the node that went down with log: “77 transient queues from an old incarnation of node '[vhost-address-of-node-that-went-down]' deleted in 2.627854s”
And any transient queues declared on the running nodes, that used to be on the node that went down, have a chance of disappearing.
Looking at the rabbitmq code and tracing down the source of the “77 transient queues from an old incarnation of node '[vhost-address-of-node-that-went-down]' deleted in 2.627854s” log:
rabbit_amqqueue.erl:1998
Going to:
rabbit_amqqueue.erl:2009
It seems that we first get all the queues to be deleted in our mnesia table with:
queues_to_delete_when_node_down()
(rabbit_amqqueue.erl:2033)And afterwards we start deleting those queues in batches of 10 queues at the time with this
partition_queues()
(rabbit_amqqueue.erl:2028) function and thedelete_queue()
function just deleting the queues without checking on which node the queue belongs to.Thus it seems possible to create a queue with the same name that used to be on the node that went down between the deletion of queues partitions and then getting deleted again in the following batch of queue deletions.
And because this on_node_down logic is assuming that all these queues are not alive anymore, the rabbitmq doesn’t send a
basic_cancel
to the consumers and the consumer never knows that the queue is deleted.With my very limited knowledge of the rabbitmq codebase, I am wondering if it would be possible to ensure that the queue we delete in the mnesia table still belongs to the node that went down in the same mnesia transaction? Or maybe there is a another solutions to solve this?
Beta Was this translation helpful? Give feedback.
All reactions