Upgrading from 3.10.17 to 3.11.9 rabbit_peer_discovery_cleanup keeps crashing and will not remove the node. #7297
-
Upgrading from 3.10.17 to 3.11.9 (Erlang 25.2) rabbit_peer_discovery_cleanup keeps crashing and will not remove the node. I also can not manually remove the node using forget_cluster_node. I see mention of this in 3.8 (rabbitmq/discussions#183 (comment)), but not in 3.11. How do I remove the node?
|
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 21 replies
-
Anyone have a clue how to remove node? I have two nodes that are stuck in this state. |
Beta Was this translation helpful? Give feedback.
-
Same issue here after an upgrade from 3.10.18 to 3.11.15 |
Beta Was this translation helpful? Give feedback.
-
Nobody here has provided steps to reproduce this issue. It needs to be more detailed than "upgrade from X to Y" - what are the exact commands you are using to upgrade? |
Beta Was this translation helpful? Give feedback.
-
Hi. Unfortunately, I have not been able to reproduce this issue on three other clusters that we have. The cluster is created via |
Beta Was this translation helpful? Give feedback.
-
I did exactly the same steps as @brycechesternewman, adding new AMI with the new rabbitmq / erlang version in AWS autoscaling group, waiting to have a working mixed version cluster and removing old cluster node by destroying them. (From 3.7 to 3.10 without issues and following compatibility matrix) Theses logs are returned by the running nodes, when trying to remove old nodes. (Theses instances no longer exists, destroyed by downsizing the scaling group) with (trying to follow CLI usage command) :
(I can try to reproduce in a lab environment with a new 3.10 fresh cluster and try to upgrade to 3.11) |
Beta Was this translation helpful? Give feedback.
-
ran in the same issue after upgrading from 3.11.18 to 3.12.4
|
Beta Was this translation helpful? Give feedback.
-
This thread makes it sound like some folks believe that AWS autoscaling groups is an upgrade mechanism. That's not really how the feature flags subsystem was supposed to be used, in particular it creates a scenario where previously reachable nodes on older versions are already gone, which is not the case with rolling cluster upgrades. What is also not present in such "upgrades" is any automation that would know when and how to enable all feature flags. During traditional rolling upgrades that moment is right after the last node is rolled and boots (rejoins the cluster). |
Beta Was this translation helpful? Give feedback.
-
We use multiple ASGs for upgrades because it allows a mostly-automated process to get onto a new version of RMQ/Erlang/AMIs without reducing capacity while also having a reasonable checkpoint for verification and trivial completion/rollback: simply delete the ASG for the nodes you want to go away. In-place rolling upgrades with automatic feature flag enabling doesn't have the same benefits. The unreachable and irremovable nodes make it impossible to upgrade without migration to another cluster entirely. Such nodes cannot be forgotten with I was able to work around the unreachable and irremovable node issue by creating some impostor nodes and then removing those.
I had to do some extra work with the ASGs given the AWS permissions I had access to, but in principle the same trick would be doable without the ASG-related steps and forgetting/resetting steps, just by having some fresh nodes that had never joined a cluster and editing the files on those before they ever tried. Really it would be nice if |
Beta Was this translation helpful? Give feedback.
@dumbbell has suggested that as the node only has one copy in 3.11 and up to 3.12.3, if the node that was hosting it is gone forever, all operations on said table will fail. The best RabbitMQ can do is to ignore the errors but as of #9005, it should not be necessary in theory.
This is a limitation that won't be relevant starting with 3.13 and Khepri, and 4.0 for all users. A workaround should be fairly straightforward: separate upgrades from AWS autoscaling group actions, for example, by using an AMI or pinning the RabbitMQ version any other way so that an AWS ASG action does not introduce a cluster member with a different version.