Potential race condition with 'x-consistent-hash' exchanges #3001

a1ezzz · 2021-04-24T11:17:09Z

a1ezzz
Apr 24, 2021

Hello, I haven't received any feedback on my post in google groups (https://groups.google.com/g/rabbitmq-users/c/sBVo61hHlWg). I hope someone will help me with my issue here =)

I've found error that looks like a race condition error. I don't know how to reproduce this bug, but still. First of all I use RabbitMQ 3.8.9 (Erlang 23.1.5) as cluster of five docker containers. I have several 'x-consistent-hash' exchanges and some time some of them begin to drop messages. In time of error following messages appear in logs:

2021-03-22 18:30:51.778 [warning] <0.25225.261> Bucket 7 not found

I've done some research and executed the command:

 rabbitmqctl eval 'rabbit_exchange_type_consistent_hash:ring_state(<<"/">>, <<"Exch1ConsistentHash">>).'

(Exch1ConsistentHash is a durable 'x-consistent-hash' exchange that has a single consumer with routing key 10)

it shows me this:

{ok,{chx_hash_ring,{resource,<<"/">>,exchange,<<"Exch1ConsistentHash">>},
                  #{10 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    11 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    12 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    13 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    14 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    15 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    16 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    17 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    18 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>},
                    19 =>
                        {resource,<<"/">>,queue,
                                  <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>}},
                  20}}

I don't know Erlang, but tell me please whether I'm wrong or not:

According to this https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbitmq_consistent_hash_exchange/src/rabbit_exchange_type_consistent_hash.erl
the "SelectedBucket" value should always be between 0 and a size of a map (which in my case is 10)
The last integer value for chx_hash_ring is a next number that a next queue will use (in my case is 20):
...

   <<"aioamqp.gen-9183a64b-85fd-49e7-b648-f759e9ec2eed">>}},
   20}}

If this true, then there is a race condition somewhere.

Thank you

Answered by mkuratczyk

Jun 13, 2023

Ok, then I must have got confused when I thought I reproduced the problem. As far as I can tell, this issue has been fixed a year ago: 878f369. I've been playing with the latest version and can't reproduce.

View full answer

michaelklishin · 2021-04-24T15:03:20Z

michaelklishin
Apr 24, 2021
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2021-04-24T15:08:43Z

michaelklishin
Apr 24, 2021
Maintainer

The ring is updated when bindings change. If this happens on a separate connection or channel right before or in the middle of a routing operation on another, this can affect publishing. This is generally true for all exchange types but this one Is stateful.

There were no major changes to this plugin since rabbitmq/rabbitmq-consistent-hash-exchange#37 except for one thing: rabbitmq/rabbitmq-consistent-hash-exchange#46, which adapts to binding recovery changes in RabbitMQ core.

Please upgrade to 3.8.14 or 3.8.15 which is expected to come out next week.

1 reply

michaelklishin Apr 26, 2021
Maintainer

I've noticed that one link ended up being broken. It should be correct now.

michaelklishin · 2021-04-26T04:46:12Z

michaelklishin
Apr 26, 2021
Maintainer

We are not aware of any single channel cases where this plugin can end up with an inconsistent binding state. If you have a way to reproduce against 3.8.14, please share a repo on GitHub.

The plugin could be more forgiving and simply ignore the cases when the ring concurrently changes but not everyone would agree this is a good idea.

0 replies

hanej · 2021-05-13T19:35:35Z

hanej
May 13, 2021

We just hit this issue today. The only way to fix it was to bring down the entire cluster and start them back again, one node at a time. We are on version 3.8.14.

2 replies

hanej May 24, 2021

We hit this issue again in 2 different clusters in different data centers. Anything we can do to help identify the cause, @michaelklishin?

hanej May 28, 2021

We've been hitting this on a very consistent basis and have since switched away from x-consistent-hash to x-random. We couldn't afford losing messages or the corruption of the vhost when this problem occurred.

Greatsamps · 2021-05-30T08:22:18Z

Greatsamps
May 30, 2021

We are on 3.8.16 and have been having issue with this as well. As others have said, the only solution is to down the entire cluster and start back up 1 by 1.

We can reproduce this quite easily in our own environment, but it would not be possible to share a repo that does this.

We have a 3-node cluster with about 5,000 publishers sending message to a consistent-hash exchange. On there we have 24 consumers processing these messages.

We can reproduce this by taking one of the nodes offline. even though the consumers who were on that node reconnect to other nodes, the hashing gets messed up and we start dropping messages. Even if the missing node comes back, its still not working.

Only solution is to down all nodes, then start back up 1 by 1.

For now we are running on a single node which we have put on a clustered VM, this is working fine and whilst we would have a couple of mins of outage if there is a problem with it, its better than having to manually reset the cluster.

As we can reproduce it quite easily, i am happy to do this next weekend and collect any data you may want, or even provide a screen share access if helpful.

1 reply

Greatsamps May 30, 2021

Just to add here. We tried a test setup with 3 node cluster, hashed-exchange and 24 consumers. We could not reproduce the setup, so it leads me to think its something to do with the scale of the messages being published to it.

Alan-R · 2023-04-05T21:13:03Z

Alan-R
Apr 5, 2023

We are currently running 3.9.10, and are seeing an issue which seems to be identical to this issue. We publish a message without error, and it never shows up. The logs show "bucket 1 not found". Rabbit had been up and stable quite a while before clients connected to it.

5 replies

Alan-R Apr 5, 2023

And clients had been connected a while (a few minutes) before we started publishing messages.

Alan-R Apr 5, 2023

This appears to only be associated with exchanges which have persistent volume claims and durable messages. Our other exchanges seem to be working well.

Alan-R Apr 6, 2023

For us, it's very low volume work, a 3-node cluster. Probably no more than a few hundred messages.

michaelklishin Apr 6, 2023
Maintainer

There were changes to the consistent hashing exchange after 3.9.10, for example, #5124 (#3386) in 3.9.21.

3.9 is out of support so unless the issue can be reliably reproduced against 3.11.13, we will leave it to 3.9 users to investigate. Note that there almost certainly will be no more 3.9 releases.

Alan-R Apr 6, 2023

OK. I'll see if I can get us upgraded and see if this fixes the problem. Thanks @michaelklishin!

a1ezzz · 2023-06-07T12:46:22Z

a1ezzz
Jun 7, 2023
Author

Hello. May be just for a history, or may future release will test it. The time we spot this issue we found the following code, that always fails

`
import pika
import random

host = ""
port = 5672
username = ""
password = ""

connection = pika.BlockingConnection(pika.ConnectionParameters(
host=host,
port=port,
credentials=pika.credentials.PlainCredentials(username=username, password=password)
))
channel = connection.channel()

exchange_name = 'TestExchange'
channel.exchange_declare(exchange=exchange_name,
exchange_type='x-consistent-hash')

queue_name = 'TestQueue'
channel.queue_declare(queue=queue_name)

channel.queue_bind(exchange=exchange_name, queue=queue_name, routing_key='10')

channel.queue_bind(exchange=exchange_name, queue=queue_name, routing_key='20')

channel.queue_delete(queue=queue_name)
`

After execution of this code a ring became corrupted:
rabbitmq@71e91092c35b:/$ rabbitmqctl eval 'rabbit_exchange_type_consistent_hash:ring_state(<<"/">>, <<"TestExchange">>).' RABBITMQ_ERLANG_COOKIE env variable support is deprecated and will be REMOVED in a future version. Use the $HOME/.erlang.cookie file or the --erlang-cookie switch instead. {ok,{chx_hash_ring,{resource,<<"/">>,exchange,<<"TestExchange">>},#{},10}}

5 replies

mkuratczyk Jun 12, 2023
Maintainer

Thanks a lot for the reproduction steps. I've created an issue for this: #8537

mkuratczyk Jun 13, 2023
Maintainer

@a1ezzz what version do you see this problem on?

a1ezzz Jun 13, 2023
Author

We use outdated and unsupported version -- 'RabbitMQ 3.8.9'; the code is at least for a history, or may be someone with this code will check a newer version

mkuratczyk Jun 13, 2023
Maintainer

Ok, then I must have got confused when I thought I reproduced the problem. As far as I can tell, this issue has been fixed a year ago: 878f369. I've been playing with the latest version and can't reproduce.

Answer selected by mkuratczyk

a1ezzz Jun 13, 2023
Author

Thanks! Then we will plan for update =)

Potential race condition with 'x-consistent-hash' exchanges #3001

Uh oh!

Replies: 7 comments · 14 replies

Uh oh!

michaelklishin Apr 24, 2021 Maintainer

Uh oh!

Uh oh!

michaelklishin Apr 24, 2021 Maintainer

Uh oh!

michaelklishin Apr 26, 2021 Maintainer

Uh oh!

michaelklishin Apr 26, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelklishin Apr 6, 2023 Maintainer

Uh oh!

Uh oh!

a1ezzz Jun 7, 2023 Author

Uh oh!

mkuratczyk Jun 12, 2023 Maintainer

Uh oh!

mkuratczyk Jun 13, 2023 Maintainer

Uh oh!

a1ezzz Jun 13, 2023 Author

Uh oh!

mkuratczyk Jun 13, 2023 Maintainer

Uh oh!

a1ezzz Jun 13, 2023 Author

Replies: 7 comments 14 replies

michaelklishin
Apr 24, 2021
Maintainer

michaelklishin
Apr 24, 2021
Maintainer

michaelklishin Apr 26, 2021
Maintainer

michaelklishin
Apr 26, 2021
Maintainer

michaelklishin Apr 6, 2023
Maintainer

a1ezzz
Jun 7, 2023
Author

mkuratczyk Jun 12, 2023
Maintainer

mkuratczyk Jun 13, 2023
Maintainer

a1ezzz Jun 13, 2023
Author

mkuratczyk Jun 13, 2023
Maintainer

a1ezzz Jun 13, 2023
Author