3.9: how to force remove a node? #8833
-
OS : Ubuntu 22.04 Describe the bugSomething has drawn my attention on one of our cluster. I check the quorum queues status and get : Status of quorum queue ms_rails;destroy_photo_set on node rabbit@ip-10-1-103-210 ... I check AWS and indeed, one of the nodes has crashed and been replaced in the autoscaling group. But now i'm in a situation where i can't seem to get rid of this ghost node and it prevents rebalancing quorum queues accross all 3 nodes as well as the UI problems, and maybe more. Reproduction steps
Re-balancing leaders of quorum queues... 15:47:33.621 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest1"}: :unknown_member 15:47:33.622 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest2"}: :unknown_member 15:47:33.623 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest3"}: :unknown_member 15:47:33.624 [warn] Error migrating queue {:resource, "staging2", :queue, "jmtest4"}: :unknown_member 15:47:33.629 [warn] Node :"rabbit@ip-10-1-101-9" contains 4 queues, but all have already migrated. Do nothing How can i get rid of this node ? Expected behaviorI'm not even sure what i don't understand more between the lack of ways to force remove a dead node, or the fact that losing 1/3 node creates such problem Additional contextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
RabbitMQ 3.9 is out of community support. To remove a QQ replica you must have a quorum of nodes online, then use CLI tools. The fact that this is a manual step You can force remove a cluster member (a node, not a queue replica). IIRC since 3.9 this should remove all QQ replicas from that node. |
Beta Was this translation helpful? Give feedback.
-
In 3.13 scheduled to ship later in 2023, quorum queue replica management will be more automatic, including (likely opt-in) way of removing replicas: #8218 from nodes that are no longer cluster members. In a similar spirit, several cluster formation plugins offer a periodic removal of unavailable nodes, a feature that I would not recommend to most but which was originally introduced for exactly this kind of environments: AWS autoscaling groups. Again, since 3.9 this should be handled better by QQs that have replicas on the node that's being removed. |
Beta Was this translation helpful? Give feedback.
-
Thx for your reply.
As stated above, such commands as rabbitmqctl forget_cluster_node don't work to remove an offline node. I understand that from 3.13, there will be tools to answer that but can something be done on 3.9 ? |
Beta Was this translation helpful? Give feedback.
-
I managed to find the settings that was responsible for the node removal in my conf : cluster_formation.node_cleanup.only_log_warning = false I removed it and now at least if AWS decides to kill one of my ASG node, it won't be removed from the cluster before i get to remove it from the quorum. What i'm still missing is the reason why the UI was laggy when one of the quorum member was missing. |
Beta Was this translation helpful? Give feedback.
RabbitMQ 3.9 is out of community support.
To remove a QQ replica you must have a quorum of nodes online, then use CLI tools. The fact that this is a manual step
is by design, the team was extra careful about making QQ replica management "highly dynamic", as it was one of the fundamental issues with classic mirrored queues (now deprecated).
You can force remove a cluster member (a node, not a queue replica). IIRC since 3.9 this should remove all QQ replicas from that node.
I am less sure whether that operation is forced for cluster nodes that are no longer available.