rabbitmq-queues rebalance fails after an upgrade from 3.8.5 to 3.8.14 #2903

gmr · 2021-03-17T18:18:37Z

gmr
Mar 17, 2021

I upgraded a few RabbitMQ clusters from 3.8.5 to 3.8.14 using official deb packages today and noticed an issue with the rabbitmq-queues cli.

From a user perspective it errors out with:

Re-balancing leaders of classic queues...
Error:
{:noproc, {:gen_server2, :call, [#PID<11978.2586.771>, {:info, [:messages]}, :infinity]}}

In the crash logs this is seen as:

2021-03-17 17:41:14 =ERROR REPORT====
** Connection attempt from node 'rabbitmqcli-679-rabbit@NODE' rejected. Invalid challenge reply. **

This made me think it's a cookie issue, but I am able to run rabbitmqctl and rabbitmq-diagnostics without issue.

In addition, the info.log looks like the command connected in and started the process:

2021-03-17 18:10:22.504 [info] <0.13198.0> Starting queue rebalance operation: 'classic' for vhosts matching '.*' and queues matching '.*'

Erlang Version: 23.2.7
RabbitMQ 3.8.14
Ubuntu 16.04.6 LTS

Interestingly I was able to reproduce this on all nodes that upgraded from 3.8.5 to 3.8.14, but on nodes where the cluster was upgraded from 3.7.16, the problem does not exist.

Please let me know if there is any other info you'd like me to collect.

gmr · 2021-03-17T18:21:38Z

gmr
Mar 17, 2021
Author

I should point out the upgrades were performed as a rolling upgrade. Node 1 was taken offline, packages upgraded, and added back to the cluster, followed by Node 2, etc.

0 replies

michaelklishin · 2021-03-18T02:53:14Z

michaelklishin
Mar 18, 2021
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2021-03-18T02:56:50Z

michaelklishin
Mar 18, 2021
Maintainer

Re-balancing leaders of classic queues...
Error:
{:noproc, {:gen_server2, :call, [#PID<11978.2586.771>, {:info, [:messages]}, :infinity]}}

very likely means a classic queue leader process is not running. This has nothing to do with CLI tools.

Invalid challenge reply

means the node responded with an authentication code that usually means a mismatched cookie.

Consider restarting nodes one by one and trying rebalancing again. Note that classic mirrored queues should be considered a deprecated feature that will be removed in a future version.

4 replies

gmr Mar 18, 2021
Author

I've restarted nodes one at a time and the problem persists. The only thing I can think to do beyond the troubleshooting I've already done is to stop the entire cluster and bring it back up, but I don't think it will fix it. I do think it's a legitimate bug.

bcnuvei May 3, 2023

Hello,
We ran into this today upgrading from 3.10.7 to 3.11.5.
I noticed that "classic mirrored queues should be considered a deprecated feature that will be removed in a future version." and let our developers know.

When running "rabbitmq-queues rebalance 'all' --vhost-pattern '.' --queue-pattern '." on one node we receive this:
Re-balancing leaders of all queues...

10:59:55.479 [error] Discarding message {'$gen_call',{<0.23669.0>,#Ref<0.2325673933.3719823362.57124>},{info,[messages]}} from <0.23669.0> to <0.17643.0> in an old incarnation (1683135063) of this node (1 683136652)

Error:
{:noproc, {:gen_server2, :call, [#PID<12016.17643.0>, {:info, [:messages]}, :infinity]}}

**When running the same command on the other two nodes in the cluster we received this:
10:59:55.479 [error] Discarding message {'$gen_call',{<0.23669.0>,#Ref<0.2325673933.3719823362.57124>},{info,[messages]}} from <0.23669.0> to <0.17643.0> in an old incarnation (1683135063) of this node (1 683136652)

Error:
{:noproc, {:gen_server2, :call, [#PID<12016.17643.0>, {:info, [:messages]}, :infinity]}}

...

The same errors appear even after restarting the nodes individually, waiting for them to rejoin the cluster, etc,.

@michaelklishin Where can I find the github discussion on this you referenced above.

michaelklishin May 3, 2023
Maintainer

I do not see any references to GitHub discussions in my reply from two years ago.

There were hardly any changes to classic mirrored queues since then and won't be in the future (classic non-mirrored queues have seen a non-trivial investment and the plan is to continue with their improvements).

My best guess is that this CM queue ended up without an electable leader. Disabling mirroring may or may not help. Deletion and redeclaration of the queue should.

Quorum queues do not have this problem (but require a quorum of nodes to be online). They have been around for several years and there is a set of pretty well understood migration steps besides Blue Green deployment.

Classic mirrored queues will be completely removed in 4.0 which is expected early next year. This was announced over one year ago.

michaelklishin May 3, 2023
Maintainer

"Discarding message ... from an old incarnation of this node" has been discussed hundreds of times on various mailing lists, forums, and chats.

A short description is: an Erlang message that was sent to this node before it was restarted has made it, and was discarded since there is no guarantee of Erlang process identifiers between runtime restarts. It is very likely a distraction and not the root cause.

The "no pid" exception is what leads me to believe that the leader replica failed to start or was prevented from being promoted by the mirroring policy. The code seeming tries to collect a queue metric for emission.

bcnuvei · 2023-05-03T19:43:14Z

bcnuvei
May 3, 2023

Hello, Here’s the reference to converting this to a GitHub Discussion… ***@***.*** ***@***.***<https://github.com/michaelklishin> michaelklishin<https://github.com/michaelklishin> on Mar 17, 2021<#2903 (comment)> Maintainer I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue<community/community#3197> even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :( . Thanks for the info you provided. I will pass this on to the developers. From: Michael Klishin ***@***.***> Sent: Wednesday, May 3, 2023 1:41 PM To: rabbitmq/rabbitmq-server ***@***.***> Cc: Byron Carlson ***@***.***>; Comment ***@***.***> Subject: Re: [rabbitmq/rabbitmq-server] rabbitmq-queues rebalance fails after an upgrade from 3.8.5 to 3.8.14 (Discussion #2903) [External email -- Courriel externe] I do not see any references to GitHub discussions in my reply from two years ago. There were hardly any changes to classic mirrored queues since then and won't be in the future (classic non-mirrored queues have seen a non-trivial investment and the plan is to continue with their improvements). My best guess is that this CM queue ended up without an electable leader<https://rabbitmq.com/ha.html#unsynchronised-mirrors>. Disabling mirroring may or may not help. Deletion and redeclaration of the queue should. Quorum queues do not have this problem (but require a quorum of nodes to be online). They have been around for several years and there is a set of pretty well understood migration steps<https://blog.rabbitmq.com/posts/2023/02/quorum-queues-migration/> besides Blue Green deployment. Classic mirrored queues will be completely removed in 4.0 which is expected early next year. This was announced over one year ago. — Reply to this email directly, view it on GitHub<#2903 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7S3FLRMPMQXUPUGBXTNV33XEKRELANCNFSM4ZLYNRHQ>. You are receiving this because you commented.Message ID: ***@***.******@***.***>> Click here<https://www.mailcontrol.com/sr/UFCPharrJYzGX2PQPOmvUkHxlm0SZLPCtfsSz5svJHlr1wdUDsztnuDzLl8J6EdFfL5n6GRS7DmNI5tH4SJ_Mg==> to report this email as spam.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rabbitmq-queues rebalance fails after an upgrade from 3.8.5 to 3.8.14 #2903

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

rabbitmq-queues rebalance fails after an upgrade from 3.8.5 to 3.8.14 #2903

Uh oh!

gmr Mar 17, 2021

Replies: 4 comments · 4 replies

Uh oh!

gmr Mar 17, 2021 Author

Uh oh!

michaelklishin Mar 18, 2021 Maintainer

Uh oh!

michaelklishin Mar 18, 2021 Maintainer

Uh oh!

gmr Mar 18, 2021 Author

Uh oh!

bcnuvei May 3, 2023

Uh oh!

michaelklishin May 3, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin May 3, 2023 Maintainer

Uh oh!

bcnuvei May 3, 2023

gmr
Mar 17, 2021

Replies: 4 comments 4 replies

gmr
Mar 17, 2021
Author

michaelklishin
Mar 18, 2021
Maintainer

michaelklishin
Mar 18, 2021
Maintainer

gmr Mar 18, 2021
Author

michaelklishin May 3, 2023
Maintainer

michaelklishin May 3, 2023
Maintainer

bcnuvei
May 3, 2023