Upgrading from 3.10.17 to 3.11.9 rabbit_peer_discovery_cleanup keeps crashing and will not remove the node. #7297

brycechesternewman · 2023-02-14T19:44:57Z

brycechesternewman
Feb 14, 2023

Upgrading from 3.10.17 to 3.11.9 (Erlang 25.2) rabbit_peer_discovery_cleanup keeps crashing and will not remove the node. I also can not manually remove the node using forget_cluster_node. I see mention of this in 3.8 (rabbitmq/discussions#183 (comment)), but not in 3.11. How do I remove the node?

sudo rabbitmqctl forget_cluster_node rabbit@ip-10-52-140-232
Removing node rabbit@ip-10-52-140-232 from the cluster
Error:
{:failed_to_remove_node, :"rabbit@ip-10-52-140-232", {:no_exists, :rabbit_node_maintenance_states}}

2023-02-14 19:39:33.219185+00:00 [warning] <0.25376.5> Peer discovery: removing unknown node rabbit@ip-10-52-140-232 from the cluster
2023-02-14 19:39:33.219335+00:00 [info] <0.25376.5> Removing node 'rabbit@ip-10-52-140-232' from cluster
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5> ** Generic server rabbit_peer_discovery_cleanup terminating
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5> ** Last message in was check_cluster
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5> ** When Server state == {state,60,false,
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>                                {interval,#Ref<0.3385719909.1621622785.19975>}}
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5> ** Reason for termination ==
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5> ** {bad_return_value,
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>        {error,
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>            {failed_to_remove_node,'rabbit@ip-10-52-140-232',
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>                {no_exists,rabbit_node_maintenance_states}}}}
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5> ** Client <0.28057.5> stacktrace
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5> ** [{gen,do_call,4,[{file,"gen.erl"},{line,256}]},
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>     {gen_server,call,2,[{file,"gen_server.erl"},{line,366}]},
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>     {rabbit_peer_discovery_cleanup,check_cluster,0,
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>                                    [{file,"rabbit_peer_discovery_cleanup.erl"},
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>                                     {line,62}]}]
2023-02-14 19:39:33.228306+00:00 [error] <0.25376.5>
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>   crasher:
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     initial call: rabbit_peer_discovery_cleanup:init/1
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     pid: <0.25376.5>
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     registered_name: rabbit_peer_discovery_cleanup
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     exception exit: {bad_return_value,
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>                         {error,
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>                             {failed_to_remove_node,'rabbit@ip-10-52-140-232',
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>                                 {no_exists,rabbit_node_maintenance_states}}}}
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>       in function  gen_server:handle_common_reply/8 (gen_server.erl, line 1245)
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     ancestors: [rabbit_peer_discovery_common_sup,<0.25374.5>]
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     message_queue_len: 0
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     messages: []
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     links: [<0.25375.5>]
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     dictionary: []
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     trap_exit: false
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     status: running
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     heap_size: 10958
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     stack_size: 28
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>     reductions: 33463
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>   neighbours:
2023-02-14 19:39:33.228605+00:00 [error] <0.25376.5>
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>     supervisor: {local,rabbit_peer_discovery_common_sup}
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>     errorContext: child_terminated
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>     reason: {bad_return_value,
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                 {error,
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                     {failed_to_remove_node,'rabbit@ip-10-52-140-232',
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                         {no_exists,rabbit_node_maintenance_states}}}}
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>     offender: [{pid,<0.25376.5>},
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                {id,rabbit_peer_discovery_cleanup},
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                {mfargs,{rabbit_peer_discovery_cleanup,start_link,[]}},
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                {restart_type,permanent},
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                {significant,false},
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                {shutdown,300000},
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>                {child_type,worker}]
2023-02-14 19:39:33.228910+00:00 [error] <0.25375.5>
2023-02-14 19:39:33.229195+00:00 [info] <0.28068.5> Peer discovery: enabling node cleanup (will remove nodes not known to the discovery backend). Check interval: 60 seconds

Answered by michaelklishin

Sep 6, 2023

@dumbbell has suggested that as the node only has one copy in 3.11 and up to 3.12.3, if the node that was hosting it is gone forever, all operations on said table will fail. The best RabbitMQ can do is to ignore the errors but as of #9005, it should not be necessary in theory.

This is a limitation that won't be relevant starting with 3.13 and Khepri, and 4.0 for all users. A workaround should be fairly straightforward: separate upgrades from AWS autoscaling group actions, for example, by using an AMI or pinning the RabbitMQ version any other way so that an AWS ASG action does not introduce a cluster member with a different version.

View full answer

brycechesternewman · 2023-02-28T15:47:48Z

brycechesternewman
Feb 28, 2023
Author

Anyone have a clue how to remove node? I have two nodes that are stuck in this state.

0 replies

fapeliberty · 2023-05-05T09:41:20Z

fapeliberty
May 5, 2023

Same issue here after an upgrade from 3.10.18 to 3.11.15

0 replies

lukebakken · 2023-05-05T13:28:48Z

lukebakken
May 5, 2023
Maintainer

Nobody here has provided steps to reproduce this issue. It needs to be more detailed than "upgrade from X to Y" - what are the exact commands you are using to upgrade?

1 reply

fapeliberty May 5, 2023

Like in the next post by @brycechesternewman , by adding new node in the cluster with compatible version directly by cluster_formation.peer_discovery_backend = rabbit_peer_discovery_aws. And removing old nodes after sync with a scaling down + forgetting them.

brycechesternewman · 2023-05-05T13:47:33Z

brycechesternewman
May 5, 2023
Author

Hi. Unfortunately, I have not been able to reproduce this issue on three other clusters that we have. The cluster is created via cluster_formation.peer_discovery_backend = rabbit_peer_discovery_aws. We have 2 nodes running 3.10.17 and two nodes running 3.11.9, then we let the AWS ASG terminate the 3.11.9 nodes once we have confirmed the cluster is stable with all four nodes. I have since added the rabbitmqctl drain... and delayed shutdown of the nodes for the other clusters, so my hunch is rabbitmq did not properly handle two nodes leaving the cluster around the same time and one of the tables got stuck in a bad state. I still have two nodes throwing this error
{:failed_to_remove_node, :"rabbit@ip-10-52-140-232", {:no_exists, :rabbit_node_maintenance_states}}. At this point I really only interested in how to remove the failed nodes, since I can't seem to due that using any of the rabbitmq tooling.

1 reply

michaelklishin May 5, 2023
Maintainer

It's not clear to me what nodes logs that message and what else it may log.

rabbitmqctl stop_app and reset the node you want to remove
rabbitmqctl forget_cluster_node --offline -n {remaining cluster member} {node to remove}

fapeliberty · 2023-05-05T14:35:08Z

fapeliberty
May 5, 2023

I did exactly the same steps as @brycechesternewman, adding new AMI with the new rabbitmq / erlang version in AWS autoscaling group, waiting to have a working mixed version cluster and removing old cluster node by destroying them. (From 3.7 to 3.10 without issues and following compatibility matrix)

Theses logs are returned by the running nodes, when trying to remove old nodes. (Theses instances no longer exists, destroyed by downsizing the scaling group)

with (trying to follow CLI usage command) :

rabbitmqctl --node {current_running_node} forget_cluster_node --offline {old_node}
Error: this command requires the target node to be stopped.

(I can try to reproduce in a lab environment with a new 3.10 fresh cluster and try to upgrade to 3.11)

10 replies

michaelklishin May 9, 2023
Maintainer

You have some feature flags disabled, and specifically one flag that controls the table in the error message. Please enable all flags in clusters that are not mixed versions. After enabling them, try again. Maybe the now gone nodes had them enabled?

fapeliberty May 9, 2023

Only feature_flags_v2 switch to enabled :

ip-10-2-3-186 ~ » sudo rabbitmqctl enable_feature_flag all                                                                                                                        69 ↵
Enabling all feature flags ...
Error:
missing_clustered_nodes

Flag: classic_mirrored_queue_version, state: disabled
Flag: classic_queue_type_delivery_support, state: disabled
Flag: direct_exchange_routing_v2, state: disabled
Flag: drop_unroutable_metric, state: disabled
Flag: empty_basic_get_metric, state: disabled
Flag: feature_flags_v2, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: listener_records_in_ets, state: disabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: disabled
Flag: stream_sac_coordinator_unblock_group, state: disabled
Flag: stream_single_active_consumer, state: disabled
Flag: tracking_records_in_ets, state: disabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled

I tried to restart too.

Don't remember which features was enabled or not during my upgrade, i was more focus on version compatibility and leaving feature on the side.

michaelklishin May 9, 2023
Maintainer

Yeah, that's a limitation of the feature flag mechanism: it should ignore the missing nodes and not fail.

fapeliberty May 10, 2023

ok then, that really set the cluster in a broken state without recovery, we will migrate it. Thanks for trying to help

michaelklishin May 11, 2023
Maintainer

#8140

jderusse · 2023-09-06T20:05:00Z

jderusse
Sep 6, 2023

ran in the same issue after upgrading from 3.11.18 to 3.12.4

maintenance_mode_status feature flag is enabled
trying to remove a node finished with {:failed_to_remove_node, :"rabbit@ip-10-52-140-232", {:no_exists, :rabbit_node_maintenance_states}}
The table has a weird stats rabbitmqctl eval 'mnesia:start().' show that some node are configured with disc_copies and some with ram_copies
When trying to fix that, the output of mnesia is inconsistent, sometime complain table already exist and sometime table does not exists

root@rabbitmq-2:/var/lib/rabbitmq/mnesia# rabbitmqctl eval 'mnesia:delete_table(rabbit_node_maintenance_states).'
{aborted,{no_exists,rabbit_node_maintenance_states}}


root@rabbitmq-2:/var/lib/rabbitmq/mnesia# rabbitmqctl eval 'mnesia:table(rabbit_node_maintenance_states).'
{qlc_handle,{qlc_table,#Fun<mnesia.23.102023884>,true,
                       #Fun<mnesia.24.102023884>,#Fun<mnesia.25.102023884>,
                       #Fun<mnesia.26.102023884>,#Fun<mnesia.29.102023884>,
                       #Fun<mnesia.28.102023884>,#Fun<mnesia.27.102023884>,
                       '=:=',undefined,no_match_spec}}

root@rabbitmq-2:/var/lib/rabbitmq/mnesia# rabbitmqctl eval 'mnesia:change_table_copy_type(rabbit_node_maintenance_states, node(), ram_copies).'
{aborted,{no_exists,rabbit_node_maintenance_states}}


root@rabbitmq-2:/var/lib/rabbitmq/mnesia# rabbitmqctl eval 'mnesia:del_table_copy(rabbit_node_maintenance_states, node()).'
{aborted,{no_exists,rabbit_node_maintenance_states}}


root@rabbitmq-2:/var/lib/rabbitmq/mnesia# rabbitmqctl eval 'mnesia:add_table_copy(rabbit_node_maintenance_states, node(), ram_copies).'
{aborted,{already_exists,rabbit_node_maintenance_states,
                         '[email protected]'}}

4 replies

michaelklishin Sep 6, 2023
Maintainer

#8140 where this was supposedly addressed is scheduled to ship in 3.13.0. You can give 3.13.0-beta.6 a try (against an original cluster of 3.12.4 nodes) and see how that works.

michaelklishin Sep 6, 2023
Maintainer

If the maintenance_mode_status was enabled everywhere as you claim, the table would exist. So perhaps that's not the case on some nodes.

Maintenance status is a table that has one disc copy and multiple RAM copies, that's sufficient for what it is used for.

michaelklishin Sep 6, 2023
Maintainer

@dumbbell has suggested that as the node only has one copy in 3.11 and up to 3.12.3, if the node that was hosting it is gone forever, all operations on said table will fail. The best RabbitMQ can do is to ignore the errors but as of #9005, it should not be necessary in theory.

This is a limitation that won't be relevant starting with 3.13 and Khepri, and 4.0 for all users. A workaround should be fairly straightforward: separate upgrades from AWS autoscaling group actions, for example, by using an AMI or pinning the RabbitMQ version any other way so that an AWS ASG action does not introduce a cluster member with a different version.

Answer selected by michaelklishin

dumbbell Oct 19, 2023
Maintainer

Just to extend on this, the problem has nothing to do with feature flags here. The bug comes from the fact that the maintenance status mode Mnesia table was not replicated: it was created on a single node in a cluster and all nodes would use that single copy. If the node hosting this single copy is removed from the cluster, the other nodes will wait for this node to come back and eventually time out or error out. This bug was fixed in the following releases:

To not hit this problem, you must upgrade to one of these versions. And to avoid troubles during upgrade, make sure the node hosting the single table copy is not removed before the end of the upgrade process. Ideally, another node should be upgraded first: the new version will take care of adding table replicas on all nodes. For this to work, the node hosting it needs to be available.

michaelklishin · 2023-09-06T22:21:20Z

michaelklishin
Sep 6, 2023
Maintainer

This thread makes it sound like some folks believe that AWS autoscaling groups is an upgrade mechanism. That's not really how the feature flags subsystem was supposed to be used, in particular it creates a scenario where previously reachable nodes on older versions are already gone, which is not the case with rolling cluster upgrades.

What is also not present in such "upgrades" is any automation that would know when and how to enable all feature flags. During traditional rolling upgrades that moment is right after the last node is rolled and boots (rejoins the cluster).

0 replies

jgulotta · 2023-10-18T21:43:17Z

jgulotta
Oct 18, 2023

We use multiple ASGs for upgrades because it allows a mostly-automated process to get onto a new version of RMQ/Erlang/AMIs without reducing capacity while also having a reasonable checkpoint for verification and trivial completion/rollback: simply delete the ASG for the nodes you want to go away. In-place rolling upgrades with automatic feature flag enabling doesn't have the same benefits.

The unreachable and irremovable nodes make it impossible to upgrade without migration to another cluster entirely. Such nodes cannot be forgotten with forget_cluster_node and feature flags cannot be enabled since both operations attempt to connect to those nodes. In our case full cluster migration is a rather tall order and undesirable.

I was able to work around the unreachable and irremovable node issue by creating some impostor nodes and then removing those.

Using the existing ASG, bring as many new nodes up into the cluster as you have irremovable nodes - these will be the impostors
Execute stop_app on the impostors
Use forget_cluster_node to remove impostors from the cluster
Execute reset on the impostors and stop them
Set HOSTNAME={irremovable node name} in rabbitmq-env.conf, using a different irremovable node name on each impostor node
If using the AWS peer discovery mechanism, replace it on all impostor nodes with classic config using existing good node names
On all cluster nodes, edit /etc/hosts to have an entry for each impostor HOSTNAME value that uses its actual IP address; impostor nodes do need to have an entry for themselves
Start rabbitmq-server on each impostor node. They will join the cluster under the old irremovable node name, as if that old node has recovered
Execute stop on the impostors. They will gracefully leave the cluster taking the old irremovable node name with them
Detach impostors from the ASG and terminate them
Remove /etc/hosts entries

I had to do some extra work with the ASGs given the AWS permissions I had access to, but in principle the same trick would be doable without the ASG-related steps and forgetting/resetting steps, just by having some fresh nodes that had never joined a cluster and editing the files on those before they ever tried.

Really it would be nice if forget_cluster_node had flags to force removal for unreachable nodes. That would have made the whole workaround unnecessary

5 replies

michaelklishin Oct 18, 2023
Maintainer

I agree, will see if we can reach consensus about this on the core team. Our Raft implementation now has a forced member removal mechanism (originally it did not).

michaelklishin Oct 19, 2023
Maintainer

@jgulotta can you confirm what forced node removal setting does your cluster use? rabbitmq-diagnostics environment will display the effective configuration on the node.

jgulotta Oct 19, 2023

Our setting to only log a warning is false, and was false before the issue arose

      {cluster_formation,
          [{peer_discovery_backend,rabbit_peer_discovery_aws},
           {peer_discovery_aws,
               [snip]},
           {node_cleanup,[{cleanup_only_log_warning,false}]}]},

michaelklishin Oct 19, 2023
Maintainer

@jgulotta so the node removal mechanism is enabled. What does the node and CLI tools log when a removal attempt fails?

Our team is not sure what specifically is missing for a "forced removal" operation compared to what it already tries to do.

jgulotta Oct 19, 2023

Sorry, but the irremovable hosts are gone after my workaround so I cannot get those command outputs again, and any logs that were once available have since aged out of our retention period.

Solely from memory I recall it was the same as in the original post. Trying to forget_cluster_node for a host that no longer existed ended up complaining about maintenance states.

Oh, and scrolling back through this issue I also saw the error that the target node must be stopped, which suggested to me that it had to be reachable with the Erlang process still running, and that is obviously impossible when the node does not exist

Upgrading from 3.10.17 to 3.11.9 rabbit_peer_discovery_cleanup keeps crashing and will not remove the node. #7297

Uh oh!

Uh oh!

Replies: 8 comments · 21 replies

Uh oh!

brycechesternewman Feb 28, 2023 Author

Uh oh!

Uh oh!

lukebakken May 5, 2023 Maintainer

Uh oh!

Uh oh!

brycechesternewman May 5, 2023 Author

Uh oh!

michaelklishin May 5, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

michaelklishin May 9, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin May 9, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin May 11, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin Sep 6, 2023 Maintainer

Uh oh!

michaelklishin Sep 6, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin Sep 6, 2023 Maintainer

Uh oh!

dumbbell Oct 19, 2023 Maintainer

Uh oh!

michaelklishin Sep 6, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

michaelklishin Oct 18, 2023 Maintainer

Uh oh!

michaelklishin Oct 19, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin Oct 19, 2023 Maintainer

Uh oh!

Uh oh!

Replies: 8 comments 21 replies

brycechesternewman
Feb 28, 2023
Author

lukebakken
May 5, 2023
Maintainer

brycechesternewman
May 5, 2023
Author

michaelklishin May 5, 2023
Maintainer

michaelklishin May 9, 2023
Maintainer

michaelklishin May 9, 2023
Maintainer

michaelklishin May 11, 2023
Maintainer

michaelklishin Sep 6, 2023
Maintainer

michaelklishin Sep 6, 2023
Maintainer

michaelklishin Sep 6, 2023
Maintainer

dumbbell Oct 19, 2023
Maintainer

michaelklishin
Sep 6, 2023
Maintainer

michaelklishin Oct 18, 2023
Maintainer

michaelklishin Oct 19, 2023
Maintainer

michaelklishin Oct 19, 2023
Maintainer