Replies: 3 comments 5 replies
-
It looks like a missing clause in Note that any change of |
Beta Was this translation helpful? Give feedback.
-
Hey @michaelklishin, thanks for the response. |
Beta Was this translation helpful? Give feedback.
-
It looks like the raft library had an improvement in |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
We use a broadcast setup, where we register a durable, fanout exchange that delivers to multiple quorum queues (replication = 3 nodes, unsubscribed queue expiry = 2 minutes).
These quorum queues often get added/removed (we use these quorum queues much like temporary queues with persistence guarantees)
We've encountered race-conditions between deleting a queue (due to expiry) and message publish (with publish acknowledgements requested).
Symptoms are either:
In our case, ideally we'd want to gracefully ACK the publish if one of the queues fails to deliver a message because it's being deleted.
Note that here deletion is caused by expiry, but maybe the problem exists for explicit deletions as well, but we haven't encountered the problem in production yet, probably because for these we manually delete bindings before deleting queues.
Bug details
RabbitMQ version: 3.12.5
The exception thrown upon internal errors looks like this:
Right before this crash log, we get Raft logs indicating queue deletion:
And then the crash. Note that the crash can happen both on Raft followers/leader.
I suspect that there's a short timespan between the raft cluster deletion and the bindings removal propagation, which could leave a window where a published message gets routed to a shut-down quorum queue.
I tried to dive into rabbit-server's source code, was typically wondering if bindings are updated after raft cluster deletion:
rabbitmq-server/deps/rabbit/src/rabbit_quorum_queue.erl
Lines 719 to 733 in 10fb936
here,
delete_queue_data
happens afterra:delete_cluster
, but I couldn't confirm how therabbit_quorum_queue:delete
method gets called by theexpiry_timer
codepath...Repro
Haven't had time to build a repro script yet.
Assuming this problems exists for manual queue deletion and not just expiry, reproduction steps should look like this:
Beta Was this translation helpful? Give feedback.
All reactions