Potential race condition with 'x-consistent-hash' exchanges #3001
-
Hello, I haven't received any feedback on my post in google groups (https://groups.google.com/g/rabbitmq-users/c/sBVo61hHlWg). I hope someone will help me with my issue here =) I've found error that looks like a race condition error. I don't know how to reproduce this bug, but still. First of all I use RabbitMQ 3.8.9 (Erlang 23.1.5) as cluster of five docker containers. I have several 'x-consistent-hash' exchanges and some time some of them begin to drop messages. In time of error following messages appear in logs:
I've done some research and executed the command:
(Exch1ConsistentHash is a durable 'x-consistent-hash' exchange that has a single consumer with routing key 10) it shows me this:
I don't know Erlang, but tell me please whether I'm wrong or not:
If this true, then there is a race condition somewhere. Thank you |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 14 replies
-
I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :( |
Beta Was this translation helpful? Give feedback.
-
The ring is updated when bindings change. If this happens on a separate connection or channel right before or in the middle of a routing operation on another, this can affect publishing. This is generally true for all exchange types but this one Is stateful. There were no major changes to this plugin since rabbitmq/rabbitmq-consistent-hash-exchange#37 except for one thing: rabbitmq/rabbitmq-consistent-hash-exchange#46, which adapts to binding recovery changes in RabbitMQ core. Please upgrade to |
Beta Was this translation helpful? Give feedback.
-
We are not aware of any single channel cases where this plugin can end up with an inconsistent binding state. If you have a way to reproduce against The plugin could be more forgiving and simply ignore the cases when the ring concurrently changes but not everyone would agree this is a good idea. |
Beta Was this translation helpful? Give feedback.
-
We just hit this issue today. The only way to fix it was to bring down the entire cluster and start them back again, one node at a time. We are on version 3.8.14. |
Beta Was this translation helpful? Give feedback.
-
We are on 3.8.16 and have been having issue with this as well. As others have said, the only solution is to down the entire cluster and start back up 1 by 1. We can reproduce this quite easily in our own environment, but it would not be possible to share a repo that does this. We have a 3-node cluster with about 5,000 publishers sending message to a consistent-hash exchange. On there we have 24 consumers processing these messages. We can reproduce this by taking one of the nodes offline. even though the consumers who were on that node reconnect to other nodes, the hashing gets messed up and we start dropping messages. Even if the missing node comes back, its still not working. Only solution is to down all nodes, then start back up 1 by 1. For now we are running on a single node which we have put on a clustered VM, this is working fine and whilst we would have a couple of mins of outage if there is a problem with it, its better than having to manually reset the cluster. As we can reproduce it quite easily, i am happy to do this next weekend and collect any data you may want, or even provide a screen share access if helpful. |
Beta Was this translation helpful? Give feedback.
-
We are currently running 3.9.10, and are seeing an issue which seems to be identical to this issue. We publish a message without error, and it never shows up. The logs show "bucket 1 not found". Rabbit had been up and stable quite a while before clients connected to it. |
Beta Was this translation helpful? Give feedback.
-
Hello. May be just for a history, or may future release will test it. The time we spot this issue we found the following code, that always fails ` host = "" connection = pika.BlockingConnection(pika.ConnectionParameters( exchange_name = 'TestExchange' queue_name = 'TestQueue' channel.queue_bind(exchange=exchange_name, queue=queue_name, routing_key='10') channel.queue_bind(exchange=exchange_name, queue=queue_name, routing_key='20') channel.queue_delete(queue=queue_name) After execution of this code a ring became corrupted: |
Beta Was this translation helpful? Give feedback.
Ok, then I must have got confused when I thought I reproduced the problem. As far as I can tell, this issue has been fixed a year ago: 878f369. I've been playing with the latest version and can't reproduce.