pause_minority Cluster Partition Handling does not work as expected #8111
Replies: 9 comments 14 replies
-
Please provide the exact commands you are using to do this. Right now you're asking us to guess how to reproduce the issue the same way you have. In addition, please attach your complete configuration file. |
Beta Was this translation helpful? Give feedback.
-
The cluster is not expected to pause on a single partition since two nodes will be in a majority. Nodes do not necessarily detect such conditions instantly, the default inactivity timeout is 60 seconds IIRC, and can be configured but not to a value that's too low (say, not 5s). |
Beta Was this translation helpful? Give feedback.
-
In addition, it's important to distinguish scenarios where, say, node A loses connection to B and C. Then it will pause itself as it will be in the minority. A much trickier situation is when A disconnects from B but not from C. In this case some nodes will In RabbitMQ 4.0, these partition handling strategies will go away. The recovery strategy with Khepri will be that of Raft: the majority of nodes keeps going, and a majority of nodes must always be online, or your cluster will lose availability. So, only the connectivity of each replica to the currently elected leader controls whether it is available or needs to recover, if we oversimplify. |
Beta Was this translation helpful? Give feedback.
-
I still don't get it, node 3 was disconnected from 1 and 2, so not connected to any node at all. Why didn't it pause then? (been in this state for 3 days) |
Beta Was this translation helpful? Give feedback.
-
This is the complete config: rabbitmq.conf
advanced.config
enabled_plugins
|
Beta Was this translation helpful? Give feedback.
-
@justsomescripts the
This doesn't appear to be a 3-node cluster...? |
Beta Was this translation helpful? Give feedback.
-
@justsomescripts I couldn't reproduce what you report using this project - https://github.com/lukebakken/docker-rabbitmq-cluster When I disconnect a container from the network, RabbitMQ on that node pauses as expected. When I re-connect the container, I can see RabbitMQ try to start. What I did find, however, is this serious bug - #8114 |
Beta Was this translation helpful? Give feedback.
-
I encountered a similar problem. The
Reproduction steps: Pause any of the virtual hosts in VMware vSphere, wait for 30 seconds to 1 minute until the Kubernetes cluster reports that the node is unavailable, and then resume the host from the pause. As a result, the RabbitMQ cluster will split, with the cluster divided into two parts – 2 nodes and 1 node will operate in parallel. The nature of the partition is as follows:
At the same time, none of the cluster state check commands will indicate that the 2-node group is in the minority. Examples:
I would expect that in such a scenario, the cluster node that is in the minority would at least shut down its port so that applications can't connect to it, but that's not the case. Example of checking port availability from inside the container:
Example of checking port availability from another Kubernetes namespace:
Could you suggest an alternative solution, other than manually restarting the node as mentioned in the documentation, or waiting for version 4.0? |
Beta Was this translation helpful? Give feedback.
-
FTR, https://youtu.be/y2HAJBiXsw0?feature=shared&t=1232 is one example of |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the bug
When the connection of multiple nodes in a RabbitMQ cluster is interrupted at the same time, the partition that has the minority of nodes is still running and accepting messages.
RabbitMQ version: 3.11.13
Erlang version: 25.3
Host: Ubuntu 22.04.2 LTS
Reproduction steps
cluster_partition_handling
topause_minority
All nodes still accept messages despite
cluster_partition_handling
set topause_minority
and being partitioned.After a restart of the single node, the cluster works again as expected.
Expected behavior
RabbitMQ is paused on the single node partition.
Additional context
Output of the first nodes (partition of two)
Output of the 3rd node (partition of one)
Relevant log
Configuration
Beta Was this translation helpful? Give feedback.
All reactions