transient queue can disappear that used to be on a node that went down and gets successfully declared, bind and start consuming from on a new node #6272

Jaqy · 2022-10-27T15:03:30Z

Jaqy
Oct 27, 2022

We have an issue where a transient queue can disappear that used to be on a node that went down and gets successfully declared, bind and start consuming from on a new node without the consumer knowing that the queue is gone.

From what I have gathered, this seems to happen between when the running nodes detect that a node is down with logs: “rabbit on node ‘[vhost-address-of-node-that-went-down]' down”

and when the running nodes have finished deleting old transient queues from the node that went down with log: “77 transient queues from an old incarnation of node '[vhost-address-of-node-that-went-down]' deleted in 2.627854s”

And any transient queues declared on the running nodes, that used to be on the node that went down, have a chance of disappearing.

Looking at the rabbitmq code and tracing down the source of the “77 transient queues from an old incarnation of node '[vhost-address-of-node-that-went-down]' deleted in 2.627854s” log:

rabbit_amqqueue.erl:1998

on_node_down(Node) ->
    {Time, {QueueNames, QueueDeletions}} = timer:tc(fun() -> delete_queues_on_node_down(Node) end),
    case length(QueueNames) of
        0 -> ok;
        _ -> rabbit_log:info("~p transient queues from an old incarnation of node ~p deleted in ~fs", [length(QueueNames), Node, Time/1000000])
    end,
    notify_queue_binding_deletions(QueueDeletions),
    rabbit_core_metrics:queues_deleted(QueueNames),
    notify_queues_deleted(QueueNames),
    ok.

Going to:
rabbit_amqqueue.erl:2009

delete_queues_on_node_down(Node) ->
    lists:unzip(lists:flatten([
        rabbit_misc:execute_mnesia_transaction(
          fun () -> [{Queue, delete_queue(Queue)} || Queue <- Queues] end
        ) || Queues <- partition_queues(queues_to_delete_when_node_down(Node))
    ])).

It seems that we first get all the queues to be deleted in our mnesia table with: queues_to_delete_when_node_down() (rabbit_amqqueue.erl:2033)

And afterwards we start deleting those queues in batches of 10 queues at the time with this partition_queues() (rabbit_amqqueue.erl:2028) function and the delete_queue() function just deleting the queues without checking on which node the queue belongs to.

Thus it seems possible to create a queue with the same name that used to be on the node that went down between the deletion of queues partitions and then getting deleted again in the following batch of queue deletions.

And because this on_node_down logic is assuming that all these queues are not alive anymore, the rabbitmq doesn’t send a basic_cancel to the consumers and the consumer never knows that the queue is deleted.

With my very limited knowledge of the rabbitmq codebase, I am wondering if it would be possible to ensure that the queue we delete in the mnesia table still belongs to the node that went down in the same mnesia transaction? Or maybe there is a another solutions to solve this?

michaelklishin · 2022-10-27T15:19:29Z

michaelklishin
Oct 27, 2022
Maintainer

There is a natural race condition between what clients and nodes do in response to a node failure.
Use a quorum queue or a stream if you need better data safety guarantees. A transient
non-replicated queue won't offer much.

20 replies

Jaqy Oct 28, 2022
Author

If you wanted to test the latency of quorum queues without the mandatory fsync operation that is possible by configuring the wal_sync_method=none in the ra application. You'd have to use the advanced config option for this and isn't something we'd recommend but it may be an interesting data point.

That would be interesting and I have forwarded it to my team member how is currently doing some QQ performance testing

Jaqy Oct 28, 2022
Author

I wouldn't normally suggest this but have you considered the rabbitmq-sharding plugin at all? It has some downsides which may be blockers but in your case it seems like you want the queues to be available even when nodes are down and the sharded queue will always have a "shard" on each node.

So far I know, we haven't. But maybe it is something we could consider? @haljin

mkuratczyk Oct 28, 2022
Maintainer

I understand the issue, that part is clear. I just wanted to take this opportunity to learn about your workload, since we rarely have a chance to talk to a user who is "continuously performance testing quorum queues", and the fact that you see a performance drop was concerning. But I think we understand what you do now and you gave me some ideas to add to our benchmarking tests as some areas are not covered to the extent they can. Thanks

Jaqy Oct 31, 2022
Author

@mkuratczyk ah ok, I just wanted to make sure that you weren't wasting any time on my behalf. You can never run enough performance tests :)

mkuratczyk Nov 2, 2022
Maintainer

@Jaqy any chance you can share your testing method/tool? We do most of the tests with perf-test, which currently doesn't support Direct Reply-To. We may add this in the future but if you have a tool I could use to test direct reply-to performance, that would be great.

Also, feel free to reach out to me with anything related to performance testing (testing process, results potential bottlenecks). You will easily find me on the RabbitMQ slack. I'd love to learn more about your use cases and issues and hopefully I can help a bit too. :)

Thanks,

michaelklishin · 2022-10-28T07:21:08Z

michaelklishin
Oct 28, 2022
Maintainer

I have filed a specific issue with a couple of ideas in mind #6274.

A more attractive option for most people would be switch to quorum queues or streams with the smallest number of supported replicas (three). Channel operations in flight when a new leader
election happens will be delayed until after a leader is available. So clients don't have to do anything. On the other hand, latency inevitably will increase for that brief moment.

1 reply

Jaqy Oct 28, 2022
Author

@michaelklishin

Unfortunately are QQ's not an option for us in our use case, so I am very happy to see you have raised the issue again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

transient queue can disappear that used to be on a node that went down and gets successfully declared, bind and start consuming from on a new node #6272

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 21 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

transient queue can disappear that used to be on a node that went down and gets successfully declared, bind and start consuming from on a new node #6272

Uh oh!

Uh oh!

Jaqy Oct 27, 2022

Replies: 2 comments · 21 replies

Uh oh!

michaelklishin Oct 27, 2022 Maintainer

Uh oh!

Jaqy Oct 28, 2022 Author

Uh oh!

Jaqy Oct 28, 2022 Author

Uh oh!

mkuratczyk Oct 28, 2022 Maintainer

Uh oh!

Jaqy Oct 31, 2022 Author

Uh oh!

mkuratczyk Nov 2, 2022 Maintainer

Uh oh!

michaelklishin Oct 28, 2022 Maintainer

Uh oh!

Jaqy Oct 28, 2022 Author

Jaqy
Oct 27, 2022

Replies: 2 comments 21 replies

michaelklishin
Oct 27, 2022
Maintainer

Jaqy Oct 28, 2022
Author

Jaqy Oct 28, 2022
Author

mkuratczyk Oct 28, 2022
Maintainer

Jaqy Oct 31, 2022
Author

mkuratczyk Nov 2, 2022
Maintainer

michaelklishin
Oct 28, 2022
Maintainer

Jaqy Oct 28, 2022
Author