3.9: how to force remove a node? #8833

falcoriss · 2023-07-11T16:12:24Z

falcoriss
Jul 11, 2023

OS : Ubuntu 22.04
Version : 3.9.13-1
Erlang : 1:24.2.1

Describe the bug

Something has drawn my attention on one of our cluster.
Latencies on the web UI when navigating through, of about 10s, showing the following message sometimes : Error: could not connect to server since 2023-07-11 18:03:31. Will retry at 2023-07-11 18:03:50.
Last time i had such problems, i had problems with quorum queues so i checked it out there is something fishy.

I check the quorum queues status and get :

Status of quorum queue ms_rails;destroy_photo_set on node rabbit@ip-10-1-103-210 ...
┌────────────────────────┬─────────────────────────────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-103-210 │ follower │ 1291 │ 1291 │ 941 │ 11 │ 1 │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-103-190 │ {nodedown,'rabbit@ip-10-1-103-190'} │ │ │ │ │ │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-101-9 │ leader │ 1291 │ 1291 │ 941 │ 11 │ 1 │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-102-24 │ follower │ 1291 │ 1291 │ 941 │ 11 │ 1 │
└────────────────────────┴─────────────────────────────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘

I check AWS and indeed, one of the nodes has crashed and been replaced in the autoscaling group.

But now i'm in a situation where i can't seem to get rid of this ghost node and it prevents rebalancing quorum queues accross all 3 nodes as well as the UI problems, and maybe more.

Reproduction steps

Create a cluster with 3 nodes
Create quorum queues and balance them on all 3 nodes
Force crash a node
Replace it with a new node
check quorum queues status with : for queue in $(rabbitmqctl list_queues --vhost staging2 type, name, members | egrep "^quorum" | awk '{print $2}') ; do rabbitmq-queues quorum_status --vhost "staging2" "$queue" ; done
You will get an output like :
┌────────────────────────┬─────────────────────────────────────┬───────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name │ Raft State │ Log Index │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-103-210 │ leader │ 24 │ 24 │ undefined │ 13 │ 1 │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-103-190 │ {nodedown,'rabbit@ip-10-1-103-190'} │ │ │ │ │ │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-101-9 │ follower │ 24 │ 24 │ undefined │ 13 │ 1 │
├────────────────────────┼─────────────────────────────────────┼───────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit@ip-10-1-102-24 │ follower │ 24 │ 24 │ undefined │ 13 │ 1 │
└────────────────────────┴─────────────────────────────────────┴───────────┴──────────────┴────────────────┴──────┴─────────────────┘

rabbitmqctl forget_cluster_node rabbit@ip-10-1-103-190 will give you :

Removing node rabbit@ip-10-1-103-190 from the cluster
Error:
{:not_a_cluster_node, 'The node selected is not in the cluster.'}

rabbitmq-queues rebalance quorum will give you :

Re-balancing leaders of quorum queues...

15:47:33.621 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest1"}: :unknown_member

15:47:33.622 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest2"}: :unknown_member

15:47:33.623 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest3"}: :unknown_member

15:47:33.624 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest4"}: :unknown_member

15:47:33.629 [warn] Node :"rabbit@ip-10-1-101-9" contains 4 queues, but all have already migrated. Do nothing
┌────────────────────────┬─────────────────────────┬─────────────────────────────────────┐
│ Node name │ Number of quorum queues │ Number of replicated classic queues │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┤
│ rabbit@ip-10-1-103-210 │ 1 │ 0 │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┤
│ rabbit@ip-10-1-102-24 │ 1 │ 0 │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┤
│ rabbit@ip-10-1-101-9 │ 4 │ 0 │

How can i get rid of this node ?
How can i end up in such a state losing only 1 out of 3 nodes ?

Expected behavior

I'm not even sure what i don't understand more between the lack of ways to force remove a dead node, or the fact that losing 1/3 node creates such problem

Additional context

No response

Answered by michaelklishin

Jul 11, 2023

RabbitMQ 3.9 is out of community support.

To remove a QQ replica you must have a quorum of nodes online, then use CLI tools. The fact that this is a manual step
is by design, the team was extra careful about making QQ replica management "highly dynamic", as it was one of the fundamental issues with classic mirrored queues (now deprecated).

You can force remove a cluster member (a node, not a queue replica). IIRC since 3.9 this should remove all QQ replicas from that node.
I am less sure whether that operation is forced for cluster nodes that are no longer available.

View full answer

michaelklishin · 2023-07-11T16:21:15Z

michaelklishin
Jul 11, 2023
Maintainer

RabbitMQ 3.9 is out of community support.

To remove a QQ replica you must have a quorum of nodes online, then use CLI tools. The fact that this is a manual step
is by design, the team was extra careful about making QQ replica management "highly dynamic", as it was one of the fundamental issues with classic mirrored queues (now deprecated).

You can force remove a cluster member (a node, not a queue replica). IIRC since 3.9 this should remove all QQ replicas from that node.
I am less sure whether that operation is forced for cluster nodes that are no longer available.

0 replies

michaelklishin · 2023-07-11T16:23:32Z

michaelklishin
Jul 11, 2023
Maintainer

In 3.13 scheduled to ship later in 2023, quorum queue replica management will be more automatic, including (likely opt-in) way of removing replicas: #8218 from nodes that are no longer cluster members.

In a similar spirit, several cluster formation plugins offer a periodic removal of unavailable nodes, a feature that I would not recommend to most but which was originally introduced for exactly this kind of environments: AWS autoscaling groups. Again, since 3.9 this should be handled better by QQs that have replicas on the node that's being removed.

0 replies

falcoriss · 2023-07-12T12:18:48Z

falcoriss
Jul 12, 2023
Author

Thx for your reply.

You can force remove a cluster member (a node, not a queue replica). IIRC since 3.9 this should remove all QQ replicas from that node.
I am less sure whether that operation is forced for cluster nodes that are no longer available.

As stated above, such commands as rabbitmqctl forget_cluster_node don't work to remove an offline node.

I understand that from 3.13, there will be tools to answer that but can something be done on 3.9 ?

0 replies

falcoriss · 2023-07-20T08:42:00Z

falcoriss
Jul 20, 2023
Author

I managed to find the settings that was responsible for the node removal in my conf :

cluster_formation.node_cleanup.only_log_warning = false

I removed it and now at least if AWS decides to kill one of my ASG node, it won't be removed from the cluster before i get to remove it from the quorum.

What i'm still missing is the reason why the UI was laggy when one of the quorum member was missing.

1 reply

michaelklishin Jul 20, 2023
Maintainer

We don't have many details to work with here but my guess would be this: because some UI operations contact the queue for if a QQ does not have a quorum, all (key) operations on it will fail but some will fail after a timeout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3.9: how to force remove a node? #8833

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

3.9: how to force remove a node? #8833

Uh oh!

Uh oh!

falcoriss Jul 11, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 4 comments · 1 reply

Uh oh!

Uh oh!

michaelklishin Jul 11, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin Jul 11, 2023 Maintainer

Uh oh!

falcoriss Jul 12, 2023 Author

Uh oh!

falcoriss Jul 20, 2023 Author

Uh oh!

michaelklishin Jul 20, 2023 Maintainer

falcoriss
Jul 11, 2023

Replies: 4 comments 1 reply

michaelklishin
Jul 11, 2023
Maintainer

michaelklishin
Jul 11, 2023
Maintainer

falcoriss
Jul 12, 2023
Author

falcoriss
Jul 20, 2023
Author

michaelklishin Jul 20, 2023
Maintainer