False-positive Network Partitions Warning When All Nodes are Connected and Synchronized #11154

BeichenZhang-BCZ · 2024-05-02T17:44:54Z

BeichenZhang-BCZ
May 2, 2024

Description

We have a live broker running both CMQ and QQ that has a false-positive Network Partitions warning(in fact we have observed many false positive NWP before). The broker cluster_status shows there is network partitions but when we publish and consume messages with the broker, all queue nodes(Quorum queue in this case) have identical last applied index, indicating that all nodes are connected and synchronized. Therefore, we believe the Network Paritions warning is incorrect. This incidence was triggered by consuming 3 millions of 10KB messages as quickly as possible while conduct rolling replacements of nodes in the broker cluster.

In rabbitmqctl cluster_status

Network Partitions

Node [email protected] cannot communicate with [email protected]
Node [email protected] cannot communicate with [email protected], [email protected]

But on Node-10-0-23-51 and Node-10-0-12-240. Both of them can reach each other and publishing new messages cause quorum queue on each nodes to increase applied index to the same values:

rabbitmq rabbitmqctl eval "Nodes = rabbit_nodes:list_reachable()."  
['[email protected]',
 '[email protected]',
 '[email protected]']

Our Probing

We have tried following actions but cannot locate root-cause nor mitigate the issue:

In a three node cluster, with Node-51 and Node-240 have partitions on each other, we stopped app on Node-61 which supposes to be capable of reaching both Node-51 and Node-240, hoping that nodedown event would trigger a new evaluation of partitions on each node and clear the incorrect state. In log, we observes:

[info] <0.1105.0> rabbit on node '[email protected]' down
[debug] <0.17933.0> Quorum Queue membership reconciliation triggered: {node_down, '[email protected]'}

This indicates that rabbit_node_monitor:handle_info({'DOWN', _MRef, process, {rabbit, Node}, _Reason} has been triggered. However, it did not correct the network partitions value.

Questions

We would like to seek helps on:

suggestions for further probing/investigation
tools/commands to distinguish between a functional-impactful NWP VS a false-positive NWP
tools/commands to mitigate false-positive NWP

lukebakken · 2024-05-02T18:22:35Z

lukebakken
May 2, 2024
Maintainer

This incidence was triggered by consuming 3 millions of 10KB messages as quickly as possible while conduct rolling replacements of nodes in the broker cluster

Could you please provide more detail. One prose sentence is not enough, and we don't have time to guess -

RabbitMQ and Erlang versions
RabbitMQ configuration and entity definitions export
Exact commands used to publish and consume messages
During your test, only message consumption happened, not publishing, correct?
Be precise on what "rolling replacements of nodes" means.

0 replies

SimonUnge · 2024-05-02T18:30:53Z

SimonUnge
May 2, 2024
Maintainer

I do not have the details you ask for Luke, but I will poke my team to provide it.

I did some very quick playing around with the code, faking the scenario where the nodes had different values in their partitions state, and my eyes got stuck on this line:

https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_node_monitor.erl#L876

It might be correct, but to me it seems incorrect, as -- is right associative. I wonder if the intended code should be
(Partitions -- Down) -- NoLongerPartitioned and not, as now Partitions -- (Down -- NoLongerPartitioned)

4 replies

lukebakken May 2, 2024
Maintainer

It's as good a guess as any! Why not make your own build with that change, to see what effect it has on your test procedure?

Frequently, false-positive partitions are due to load on the cluster.

lukebakken May 2, 2024
Maintainer

It would also be great if you could run this test scenario while enabling the khepri_db feature.

BeichenZhang-BCZ May 2, 2024
Author

Yes I agree that false positive is frequently caused by loads and we have observed similar cases including this one. Given that we have a live broker with this ongoing false-positive message, do you have any probing commands/investigation suggestion that can leads us one step closer to the root cause?

I am currently trying to execute internal functions in the form of rabbitmqctl eval '..' in order to get values of Partitions, Down, NoLongerPartitioned in the runtime to verify if there is any unexpected values.

````NoLongerPartitioned: obtainable by rabbitmqctl eval 'rabbit_mnesia:cluster_nodes(running).'```
```Down```: requires pinging reachable node, which should be doable.
```Partitions```: I am not sure how to get values of that since it is a record inside rabbit_node_monitor module.

SimonUnge May 2, 2024
Maintainer

@lukebakken Ill see if we can reliable reproduce the issue, and what the effect of said code change would be, and what khepri would do. Will also look into we could add a ctl command to ask the cluster to re-evaluate its cluster status.

BeichenZhang-BCZ · 2024-05-02T19:32:14Z

BeichenZhang-BCZ
May 2, 2024
Author

Please find detailed information here:

RabbitMQ and Erlang versions
3.13.0 | 26.2.2
RabbitMQ configuration and entity definitions export
Please see the attached file: Broker_Definition_Export.json

Exact commands used to publish and consume messages
There are two round of publications. For publications before NWP is observed, here is the code snippets for publication and consumption:

 connection = pika.BlockingConnection(params)
 channel = connection.channel()

 # Send messages to the queue
 for i in range(message_count):
     message = 'x' * 10000  # 10KB message
     channel.basic_publish(exchange='',
                           routing_key=queue_name,
                           body=message,
                           properties=pika.BasicProperties(content_type='text/plain',
                                                           delivery_mode=2))

 connection = pika.BlockingConnection(params)
 channel = connection.channel()

 channel.basic_consume(queue=queue_name, on_message_callback=callback_function)

 channel.start_consuming()
 connection.close()

For publications after NWP is observed, it is done through RabbitMq Console->Queues-> publish message UI.

During your test, only message consumption happened, not publishing, correct?
When network partition was firstly noted, there were no publication or consumption.
The only action happening before NWP was observed was rolling replacements of nodes which is explained below.
Be precise on what "rolling replacements of nodes" means.
The cluster has three nodes that run on their own EC2 instances that lives in the same AWS autoscaling group. Our program will trigger ASG to replace both EC2 and its underlying storage one by one. Each time when a new instance, aka node, is up, it needs to synchronize existing messages from ground up before we proceed to replace the next instance, until all the "old" instances are replaced. In this specific case, the network partitions happened when the last "old" instance was terminated.

1 reply

lukebakken May 2, 2024
Maintainer

During your test, only message consumption happened, not publishing, correct?
When network partition was firstly noted, there were no publication or consumption.

I'm assuming that your queues must have had messages in them, but could you confirm that, as well as how many messages were in the queues?

Be precise on what "rolling replacements of nodes" means.
The cluster has three nodes that run on their own EC2 instances that lives in the same AWS autoscaling group. Our program will trigger ASG to replace both EC2 and its underlying storage one by one. Each time when a new instance, aka node, is up, it needs to synchronize existing messages from ground up before we proceed to replace the next instance, until all the "old" instances are replaced. In this specific case, the network partitions happened when the last "old" instance was terminated.

That's almost what I'm looking for when I say "precise" 😸

What I would like to see is the sequence of commands and operations used in your test, such as:

rabbitmq-upgrade -n NODE drain
rabbitmqctl -n NODE shutdown
# ASG triggers node replacement
# RabbitMQ starts
# await synchronized queues
# move on to next node...

False-positive Network Partitions Warning When All Nodes are Connected and Synchronized #11154

Uh oh!

Uh oh!

BeichenZhang-BCZ May 2, 2024

Description

Our Probing

Questions

Replies: 3 comments · 5 replies

Uh oh!

lukebakken May 2, 2024 Maintainer

Uh oh!

SimonUnge May 2, 2024 Maintainer

Uh oh!

lukebakken May 2, 2024 Maintainer

Uh oh!

lukebakken May 2, 2024 Maintainer

Uh oh!

Uh oh!

BeichenZhang-BCZ May 2, 2024 Author

Uh oh!

SimonUnge May 2, 2024 Maintainer

Uh oh!

BeichenZhang-BCZ May 2, 2024 Author

Uh oh!

lukebakken May 2, 2024 Maintainer

BeichenZhang-BCZ
May 2, 2024

Replies: 3 comments 5 replies

lukebakken
May 2, 2024
Maintainer

SimonUnge
May 2, 2024
Maintainer

lukebakken May 2, 2024
Maintainer

lukebakken May 2, 2024
Maintainer

BeichenZhang-BCZ May 2, 2024
Author

SimonUnge May 2, 2024
Maintainer

BeichenZhang-BCZ
May 2, 2024
Author

lukebakken May 2, 2024
Maintainer