Skip to content

Commit 289d4da

Browse files
Clustering guide: rework the sections on cluster member removal and node reset
1 parent cf2b6cc commit 289d4da

File tree

4 files changed

+530
-238
lines changed

4 files changed

+530
-238
lines changed

docs/clustering.md

Lines changed: 137 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -975,51 +975,52 @@ rabbitmqctl cluster_status
975975
# => ...done.
976976
```
977977

978-
## Forcing Node Boot in Case of Unavailable Peers {#forced-boot}
979978

980-
In some cases the last node to go
981-
offline cannot be brought back up. It can be removed from the
982-
cluster using the `forget_cluster_node` [rabbitmqctl](./cli) command.
979+
## How to Remove a Node from the Cluster {#removing-nodes}
983980

984-
Alternatively `force_boot` [rabbitmqctl](./cli) command can be used
985-
on a node to make it boot without trying to sync with any
986-
peers (as if they were last to shut down). This is
987-
usually only necessary if the last node to shut down or a
988-
set of nodes will never be brought back online.
981+
Sometimes it is necessary to remove a node from the cluster.
982+
983+
The sequence of actions will be slightly different for the following
984+
most common scenarios:
989985

990-
## Breaking Up a Cluster {#removing-nodes}
986+
* The node is online and reachable
987+
* The node is offline and cannot be recovered
991988

992-
Sometimes it is necessary to remove a node from a
993-
cluster. The operator has to do this explicitly using a
994-
`rabbitmqctl` command.
989+
In addition, if the cluster [peer discovery mechanisms](./cluster-formation)
990+
support node health checks and [forced removal of nodes](./cluster-formation#node-health-checks-and-cleanup) not known to the discovery backend.
995991

996-
Some [peer discovery mechanisms](./cluster-formation)
997-
support node health checks and forced
998-
removal of nodes not known to the discovery backend. That feature is
999-
opt-in (deactivated by default).
992+
That feature is opt-in (deactivated by default).
1000993

1001-
We first remove `rabbit@rabbit3` from the cluster, returning it to
1002-
independent operation. To do that, on `rabbit@rabbit3` we
1003-
stop the RabbitMQ application, reset the node, and restart the
1004-
RabbitMQ application.
994+
Continuing with the three node cluster example used in this guide,
995+
let's demonstrate how to remove `rabbit@rabbit3` from the cluster, returning it to
996+
independent operation.
997+
998+
### Removal of a Reachable Node
999+
1000+
First step before removing a node from the cluster is to stop it:
10051001

10061002
```bash
10071003
# on rabbit3
10081004
rabbitmqctl stop_app
10091005
# => Stopping node rabbit@rabbit3 ...done.
1006+
```
10101007

1011-
rabbitmqctl reset
1012-
# => Resetting node rabbit@rabbit3 ...done.
1013-
rabbitmqctl start_app
1014-
# => Starting node rabbit@rabbit3 ...done.
1008+
Then use `rabbitmqctl forget_cluster_node` on another node
1009+
and specify the node to remove as **the first positional argument**:
1010+
1011+
```bash
1012+
# on rabbit2
1013+
rabbitmqctl forget_cluster_node rabbit@rabbit3
1014+
# => Removing node rabbit@rabbit3 from cluster ...
10151015
```
10161016

1017-
Note that it would have been equally valid to list
1018-
`rabbit@rabbit3` as a node.
1017+
Running the
10191018

1019+
```shell
1020+
rabbitmq-diagnostics cluster_status
1021+
```
10201022

1021-
Running the <i>cluster_status</i> command on the nodes confirms
1022-
that `rabbit@rabbit3` now is no longer part of
1023+
command on the nodes confirms that `rabbit@rabbit3` now is no longer part of
10231024
the cluster and operates independently:
10241025

10251026
```bash
@@ -1036,17 +1037,32 @@ rabbitmqctl cluster_status
10361037
# => [{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
10371038
# => {running_nodes,[rabbit@rabbit1,rabbit@rabbit2]}]
10381039
# => ...done.
1040+
```
1041+
1042+
Now node `rabbit@rabbit3` can be decomissioned to reset and started as
1043+
a standalone node:
1044+
10391045

1046+
```shell
10401047
# on rabbit3
1048+
rabbitmqctl reset
1049+
1050+
rabbitmqctl start_app
1051+
# => Starting node rabbit@rabbit3 ...
1052+
10411053
rabbitmqctl cluster_status
10421054
# => Cluster status of node rabbit@rabbit3 ...
10431055
# => [{nodes,[{disc,[rabbit@rabbit3]}]},{running_nodes,[rabbit@rabbit3]}]
10441056
# => ...done.
10451057
```
10461058

1047-
We can also remove nodes remotely. This is useful, for example, when
1048-
having to deal with an unresponsive node. We can for example remove
1049-
`rabbit@rabbit1` from `rabbit@rabbit2`.
1059+
Nodes can be removed remotely, that is, from a different host, as long as CLI tools
1060+
on said host can [connect and authenticate](./cli) to the target node.
1061+
1062+
This can useful, for example, when having to deal with a host that cannot be accessed.
1063+
1064+
In the rest of this example, `rabbit@rabbit1` will be removed from its remaining
1065+
two node cluster with `rabbit@rabbit2`:
10501066

10511067
```bash
10521068
# on rabbit1
@@ -1059,16 +1075,32 @@ rabbitmqctl forget_cluster_node rabbit@rabbit1
10591075
# => ...done.
10601076
```
10611077

1062-
Note that `rabbit1` still thinks it's clustered with
1078+
### Removal of Stopped Nodes and Their Revival
1079+
1080+
:::important
1081+
1082+
A node that was removed from the cluster when stopped with `rabbitmqctl stop_app`
1083+
must be either reset or decomissioned. If started without a reset,
1084+
it won't be able to rejoin its original cluster.
1085+
1086+
:::
1087+
1088+
At this point `rabbit1` still thinks it is clustered with
10631089
`rabbit2`, and trying to start it will result in an
1064-
error. We will need to reset it to be able to start it again.
1090+
error because the rest of the cluster no longer considers it to be a known member:
10651091

10661092
```bash
10671093
# on rabbit1
10681094
rabbitmqctl start_app
10691095
# => Starting node rabbit@rabbit1 ...
10701096
# => Error: inconsistent_cluster: Node rabbit@rabbit1 thinks it's clustered with node rabbit@rabbit2, but rabbit@rabbit2 disagrees
1097+
```
1098+
1099+
In order to completely detach it from the cluster, such
1100+
stopped node must be reset:
1101+
10711102

1103+
```shell
10721104
rabbitmqctl reset
10731105
# => Resetting node rabbit@rabbit1 ...done.
10741106

@@ -1078,7 +1110,7 @@ rabbitmqctl start_app
10781110
```
10791111

10801112
The `cluster_status` command now shows all three nodes
1081-
operating as independent RabbitMQ brokers:
1113+
operating as independent RabbitMQ nodes (single node clusters):
10821114

10831115
```bash
10841116
# on rabbit1
@@ -1117,18 +1149,48 @@ rabbitmqctl start_app
11171149
# => Starting node rabbit@rabbit2 ...done.
11181150
```
11191151

1152+
### Removal of Unresponsive Queues
1153+
1154+
When target node is not running, it can still be removed from the cluster using
1155+
using `rabbitmqctl forget_cluster_node`:
1156+
1157+
```bash
1158+
# Tell rabbit@rabbit1 to permanently remove rabbit@rabbit2
1159+
rabbitmqctl forget_cluster_node -n rabbit@rabbit1 rabbit@rabbit2
1160+
# => Removing node rabbit@rabbit1 from cluster ...
1161+
# => ...done.
1162+
```
1163+
1164+
### What Happens to Quorum Queue and Stream Replicas?
1165+
1166+
When a node is removed from the cluster using CLI tools, all [quorum queue](./quorum-queues#replica-management)
1167+
and [stream replicas](./streams#replica-management) on the node will be removed,
1168+
even if that means that queues and streams would temporarily have an even (e.g. two) replicas.
1169+
1170+
### Node Removal is Explicit (Manual) or Opt-in
1171+
11201172
Besides `rabbitmqctl forget_cluster_node` and the automatic cleanup of unknown nodes
11211173
by some [peer discovery](./cluster-formation) plugins, there are no scenarios
11221174
in which a RabbitMQ node will permanently remove its peer node from a cluster.
11231175

1124-
### How to Reset a Node {#resetting-nodes}
11251176

1126-
Sometimes it may be necessary to reset a node (wipe all of its data) and later make it rejoin the cluster.
1127-
Generally speaking, there are two possible scenarios: when the node is running, and when the node cannot start
1128-
or won't respond to CLI tool commands e.g. due to an issue such as [ERL-430](https://bugs.erlang.org/browse/ERL-430).
1177+
1178+
## How to Reset a Node {#resetting-nodes}
1179+
1180+
:::danger
11291181

11301182
Resetting a node will delete all of its data, cluster membership information, configured [runtime parameters](./parameters),
1131-
users, virtual hosts and any other node data. It will also permanently remove the node from its cluster.
1183+
users, virtual hosts and any other node data. It will also alter its internal identity.
1184+
1185+
:::
1186+
1187+
Sometimes it may be necessary to reset a node (what specifically this means, see below),
1188+
and later make it rejoin the cluster as a new node.
1189+
1190+
Generally speaking, there are two possible scenarios: when the node is running, and when the node cannot start
1191+
or won't respond to CLI tool commands for any reason.
1192+
1193+
### Reset a Running and Responsive Node
11321194

11331195
To reset a running and responsive node, first stop RabbitMQ on it using `rabbitmqctl stop_app`
11341196
and then reset it using `rabbitmqctl reset`:
@@ -1141,16 +1203,47 @@ rabbitmqctl reset
11411203
# => Resetting node rabbit@rabbit1 ...done.
11421204
```
11431205

1206+
:::info
1207+
1208+
If the reset node is online and its cluster peers are reachable, the node
1209+
will first try to permanently remove itself from its cluster.
1210+
1211+
:::
1212+
1213+
### Reset an Unresponsive Node
1214+
11441215
In case of a non-responsive node, it must be stopped first using any means necessary.
11451216
For nodes that fail to start this is already the case. Then [override](./relocate)
1146-
the node's data directory location or [re]move the existing data store. This will make the node
1217+
the node's data directory location or remove the existing data store. This will make the node
11471218
start as a blank one. It will have to be instructed to [rejoin its original cluster](#cluster-formation), if any.
11481219

1149-
A node that's been reset and rejoined its original cluster will sync all virtual hosts, users, permissions
1150-
and topology (queues, exchanges, bindings), runtime parameters and policies. [Quorum queue](./quorum-queues)
1151-
contents will be replicated if the node will be selected to host a replica.
1220+
### Resetting a Node to Re-add It as a Brand New Node to Its Original Cluster
1221+
1222+
A reset node that was [removed from the cluster](#removing-nodes) can be re-added to its original
1223+
cluster as a brand new node.
1224+
1225+
In that case it will sync all virtual hosts, users, permissions and topology (queues, exchanges, bindings),
1226+
runtime parameters and policies.
1227+
1228+
For [quorum queue](./quorum-queues) and [stream](./streams) contents to be replicated to the new [re]added node,
1229+
the node must be added to the list of nodes to place replicas on using `rabbitmq-queues grow`.
1230+
11521231
Non-replicated queue contents on a reset node will be lost.
11531232

1233+
1234+
## Forcing Node Boot in Case of Unavailable Peers {#forced-boot}
1235+
1236+
In some cases the last node to go
1237+
offline cannot be brought back up. It can be removed from the
1238+
cluster using the `forget_cluster_node` [rabbitmqctl](./cli) command.
1239+
1240+
Alternatively `force_boot` [rabbitmqctl](./cli) command can be used
1241+
on a node to make it boot without trying to sync with any
1242+
peers (as if they were last to shut down). This is
1243+
usually only necessary if the last node to shut down or a
1244+
set of nodes will never be brought back online.
1245+
1246+
11541247
## Upgrading clusters {#upgrading}
11551248

11561249
You can find instructions for upgrading a cluster in [the upgrade

0 commit comments

Comments
 (0)