Highly available queues without data replication #6376

Jaqy · 2022-11-08T12:04:26Z

Jaqy
Nov 8, 2022

I want to start a discussion on the need for a highly available queue without data replication, what would immediately be restarted on a different node when the current node the queue is on is shutting down.

Currently we are using transient queues in our project, and I know of at least two other projects within my company that use transient queues as well. The depreciation of transient queues in RabbitMQ version 4.0 is thus a problem for us and our use-case.

One of our highest requirements is performance and availability. We work with a performance that is in ms and we are following a die fast - recover fast principle. While data safety is a much lower priority for us.

Why we are using transient queues over other queues:

Transient queues are faster than any raft based replicated queues, and because we are working with ms, is the difference too large to fulfill our performance requirements.
Durable queues become unavailable when the node goes down, meaning there can be a downtime of somewhere between a couple of minutes to like half an hour (if we get unlucky with reclaiming the PVC). Any down time is a big problem for us, due to our 99.999% availability requirement.
All our services that produce and consume are replicated by a min of 3 for redundancy reasons, so exclusive queues can’t be used here either.

So has left us with transient queues, but they aren’t perfect either;

There is a race condition that resulted in these queues being deleted by the broker without informing the consumer. See discussion transient queue can disappear that used to be on a node that went down and gets successfully declared, bind and start consuming from on a new node #6272 for more information on that problem.
The shutdown grace period prevented the consumer from being able to re-declare the queues on a new node until the old node was terminated. ( I believe this to be about 20 seconds). We have designed a solution for this, to ensure that our consumers don’t experience any down time with a node restart / shut down. This does introduce a lot more complexity, and with an increased complexity, you also increase the risk of things going wrong.

So there is a use case for a highly available queue without data replication that gets immediately restarted on a different node when the current node the queue is on is shutting down.

lhoguin · 2022-11-08T12:52:20Z

lhoguin
Nov 8, 2022
Maintainer

There's a few possibilities. One is sharding where the queue name exists on all nodes and publishers/consumers only speak with the queue process that is on the node they connect to. You would have each service have a consumer on all 3 nodes. If a node goes down, you still have the other ones. Publishers could round-robin to all available nodes as well. But that only works if you don't need a single source of truth. If the order matters somewhat this won't do.

In that case I think a mechanism similar to mirrored queues would work, except of course the data wouldn't be replicated. Instead you would have the queue process running on all 3 nodes, with only 1 active at a time. Down detection could be enhanced to obtain better results.

6 replies

lhoguin Nov 8, 2022
Maintainer

Same as mirrored CQ. They would connect to any node, speak to the process on the active node and the processes on the other nodes would be on standby. When a node goes down, the active node changes and producers/consumers get routed to that active process from that point onward. The node that went down would purge any remaining messages when it comes back up and go back to standby mode. There's probably finer details to figure out (like which node becomes active, or how to better detect node down to change the active process) but that's the rough idea.

Jaqy Nov 8, 2022
Author

That could be an idea yeah.

What do you think of this idea @haljin ?

Jaqy Nov 8, 2022
Author

I am wondering what the impact this logic, of agreeing which queue is active and detecting a node with an active queue goes down, will have on performance.

Wouldn't starting the queue new node when the node its currently is on going down be much faster and simpler?

lhoguin Nov 8, 2022
Maintainer

This is already what is happening, it just takes those 20 seconds for RabbitMQ to detect that the node is down. What I believe you need is something at the level of the queue, where the queue could indicate its liveness a lot more accurately than at the node level. There would still be a delay to detect a problem but we can lower it. One option is to make the queue use https://github.com/rabbitmq/aten to detect node downs and react, + monitor the queue process in case it crashes instead. If going for aten we would have the same reaction time as quorum queues, minus leadership election since we can just pick one arbitrarily (no data to worry about). Another option is to have the active queue process regularly ping a controller process on the other nodes and when that stops the controller changes the active node.

Jaqy Nov 9, 2022
Author

The faster, the better without overcomplicating things I would say.

Would it be an idea to rely on the consumer to declare the queue on a new node? Currently, the consumers are being blocked for like 20 seconds to be able to re-declare the queue on a new node because the old node hasn't terminated yet. The consumer reconnects and tries to declare the queue again within ms, and if we can get rid of those 20 seconds, that would speed things up a lot.

I have no idea how complicated this would be to implement, so I am just thinking out loud 😄

Any efforts to accommodate our use case/needs are very much appreciated.

kjnilsson · 2022-11-10T09:36:14Z

kjnilsson
Nov 10, 2022
Maintainer

A solution that doesn't rely of failure detectors is very likely to be more reliable, more available and provide lower overall latency.

Your idea using the alt exchange is quite a good one but the main issue is that it too relies on failure detection in order to fall back to the alternative exchange.

The suggestion to always have a durable active queue on each node where all publishes are local and all consuming apps consume from the queues on all nodes is a good one IMO.

It does not rely on failure detection.
It does not perform additional network hops to distribute the data inside RabbitMQ
It provides the same ordering guarantees that RabbitMQ is able to provide, i.e. messages published from one channel to a given queue will arrive in the queue in the order they were published.

If we could provide a rabbitmq-sharding like plugin that allowed the binding to be the same for all nodes (it will look like a single queue with a single name) I think this could be a viable option for this type of system.

The downside is that the consuming apps will need some degree of knowledge of the system itself to ensure they consume from all nodes but that is the only downside I can see atm.

0 replies

safe-bug · 2023-03-29T18:10:35Z

safe-bug
Mar 29, 2023

@Jaqy Have you found a solution for your use case? We are experiencing the exact same problem with our use case.

1 reply

Jaqy Mar 30, 2023
Author

@safe-bug we have not. We are working on replacing RabbitMQ. Currently, we are trying out NATS.

kjnilsson · 2023-05-25T09:44:17Z

kjnilsson
May 25, 2023
Maintainer

We have been working on a new exchange type which doesn't solve every challenge discussed here but does allow for highly available low latency workloads given the appropriate topology and consumer configuration.

Anyone interested in this have a look at the pull request description where a use case is outlined.

0 replies

Highly available queues without data replication #6376

Uh oh!

Uh oh!

Jaqy Nov 8, 2022

Replies: 4 comments · 7 replies

Uh oh!

lhoguin Nov 8, 2022 Maintainer

Uh oh!

lhoguin Nov 8, 2022 Maintainer

Uh oh!

Jaqy Nov 8, 2022 Author

Uh oh!

Jaqy Nov 8, 2022 Author

Uh oh!

lhoguin Nov 8, 2022 Maintainer

Uh oh!

Jaqy Nov 9, 2022 Author

Uh oh!

kjnilsson Nov 10, 2022 Maintainer

Uh oh!

safe-bug Mar 29, 2023

Uh oh!

Jaqy Mar 30, 2023 Author

Uh oh!

kjnilsson May 25, 2023 Maintainer

Jaqy
Nov 8, 2022

Replies: 4 comments 7 replies

lhoguin
Nov 8, 2022
Maintainer

lhoguin Nov 8, 2022
Maintainer

Jaqy Nov 8, 2022
Author

Jaqy Nov 8, 2022
Author

lhoguin Nov 8, 2022
Maintainer

Jaqy Nov 9, 2022
Author

kjnilsson
Nov 10, 2022
Maintainer

safe-bug
Mar 29, 2023

Jaqy Mar 30, 2023
Author

kjnilsson
May 25, 2023
Maintainer