Replies: 4 comments 7 replies
-
There's a few possibilities. One is sharding where the queue name exists on all nodes and publishers/consumers only speak with the queue process that is on the node they connect to. You would have each service have a consumer on all 3 nodes. If a node goes down, you still have the other ones. Publishers could round-robin to all available nodes as well. But that only works if you don't need a single source of truth. If the order matters somewhat this won't do. In that case I think a mechanism similar to mirrored queues would work, except of course the data wouldn't be replicated. Instead you would have the queue process running on all 3 nodes, with only 1 active at a time. Down detection could be enhanced to obtain better results. |
Beta Was this translation helpful? Give feedback.
-
A solution that doesn't rely of failure detectors is very likely to be more reliable, more available and provide lower overall latency. Your idea using the alt exchange is quite a good one but the main issue is that it too relies on failure detection in order to fall back to the alternative exchange. The suggestion to always have a durable active queue on each node where all publishes are local and all consuming apps consume from the queues on all nodes is a good one IMO.
If we could provide a The downside is that the consuming apps will need some degree of knowledge of the system itself to ensure they consume from all nodes but that is the only downside I can see atm. |
Beta Was this translation helpful? Give feedback.
-
@Jaqy Have you found a solution for your use case? We are experiencing the exact same problem with our use case. |
Beta Was this translation helpful? Give feedback.
-
We have been working on a new exchange type which doesn't solve every challenge discussed here but does allow for highly available low latency workloads given the appropriate topology and consumer configuration. Anyone interested in this have a look at the pull request description where a use case is outlined. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to start a discussion on the need for a highly available queue without data replication, what would immediately be restarted on a different node when the current node the queue is on is shutting down.
Currently we are using transient queues in our project, and I know of at least two other projects within my company that use transient queues as well. The depreciation of transient queues in RabbitMQ version
4.0
is thus a problem for us and our use-case.One of our highest requirements is performance and availability. We work with a performance that is in ms and we are following a die fast - recover fast principle. While data safety is a much lower priority for us.
Why we are using transient queues over other queues:
Transient queues are faster than any raft based replicated queues, and because we are working with ms, is the difference too large to fulfill our performance requirements.
Durable queues become unavailable when the node goes down, meaning there can be a downtime of somewhere between a couple of minutes to like half an hour (if we get unlucky with reclaiming the PVC). Any down time is a big problem for us, due to our 99.999% availability requirement.
All our services that produce and consume are replicated by a min of 3 for redundancy reasons, so exclusive queues can’t be used here either.
So has left us with transient queues, but they aren’t perfect either;
There is a race condition that resulted in these queues being deleted by the broker without informing the consumer. See discussion transient queue can disappear that used to be on a node that went down and gets successfully declared, bind and start consuming from on a new node #6272 for more information on that problem.
The shutdown grace period prevented the consumer from being able to re-declare the queues on a new node until the old node was terminated. ( I believe this to be about 20 seconds). We have designed a solution for this, to ensure that our consumers don’t experience any down time with a node restart / shut down. This does introduce a lot more complexity, and with an increased complexity, you also increase the risk of things going wrong.
So there is a use case for a highly available queue without data replication that gets immediately restarted on a different node when the current node the queue is on is shutting down.
Beta Was this translation helpful? Give feedback.
All reactions