False-positive Network Partitions Warning When All Nodes are Connected and Synchronized #11154
Replies: 3 comments 5 replies
-
Could you please provide more detail. One prose sentence is not enough, and we don't have time to guess -
|
Beta Was this translation helpful? Give feedback.
-
I do not have the details you ask for Luke, but I will poke my team to provide it. I did some very quick playing around with the code, faking the scenario where the nodes had different values in their partitions state, and my eyes got stuck on this line: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_node_monitor.erl#L876 It might be correct, but to me it seems incorrect, as |
Beta Was this translation helpful? Give feedback.
-
Please find detailed information here:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
We have a live broker running both CMQ and QQ that has a false-positive Network Partitions warning(in fact we have observed many false positive NWP before). The broker cluster_status shows there is network partitions but when we publish and consume messages with the broker, all queue nodes(Quorum queue in this case) have identical last applied index, indicating that all nodes are connected and synchronized. Therefore, we believe the Network Paritions warning is incorrect. This incidence was triggered by consuming 3 millions of 10KB messages as quickly as possible while conduct rolling replacements of nodes in the broker cluster.
In rabbitmqctl cluster_status
But on Node-10-0-23-51 and Node-10-0-12-240. Both of them can reach each other and publishing new messages cause quorum queue on each nodes to increase applied index to the same values:
Our Probing
We have tried following actions but cannot locate root-cause nor mitigate the issue:
In a three node cluster, with Node-51 and Node-240 have partitions on each other, we stopped app on Node-61 which supposes to be capable of reaching both Node-51 and Node-240, hoping that nodedown event would trigger a new evaluation of partitions on each node and clear the incorrect state. In log, we observes:
This indicates that rabbit_node_monitor:handle_info({'DOWN', _MRef, process, {rabbit, Node}, _Reason} has been triggered. However, it did not correct the network partitions value.
Questions
We would like to seek helps on:
Beta Was this translation helpful? Give feedback.
All reactions