Classic mirrored queues fail to sync in v3.11 #8975
-
Describe the bugNot all classic mirrored queues are syncing after a node becomes unavailable (caused by network, not by RabbitMQ sever on that node). The difference between a queue that is successfully synced and one that is not is the missing
from the queues that are remaining unsynced. Synced queue
Unsynced queue
In this example the node IDs are:
And the third node (index 2) was made unavailable from the network for 3 minutes. Reproduction steps
Repeat the steps above several times in a row giving some, each time giving some minutes to the cluster to recover/sync. Expected behaviorRepeating the steps above causes very often (>50%) to have unsynchronized queues after a node comes back online. I have tested the queue health after each node that is brought (after waiting a couple of minutes) back online and collected the logs after each step. Below it is visible how often the issue occurs under this load.
In our production systems we have an order of magnitude greater queues / messages. But we have several such cluster and 50% were affected (a few unsynced queues) after one Availability Zone (one node from each cluster) went down. Successive restarts ( Additional contextHere are the scripts I ran in order to replicate the issue:
#!/bin/bash
rm -f runbook.txt
while true
do
./runbook.sh >> runbook.txt
sleep 420
done
#!/bin/bash
date
./delete_queues.sh
./create_queues.sh 100
# let the cluster calm down (sync)
sleep 60
# make sure we start with clean logs
./archive_logs.sh "_after_queue_creation"
# ============================
for ((i = 0; i <= 2; i++))
do
date
# send 50 message from 40 parallel threads (50 x 40)
./publish_messages.sh 50 40 &
# make the node unreachable for 180s (ifconfig down)
./nuke_rabbit_node.sh "${i}" 180
# let the cluster be againg healty
sleep 420
./check_queue_health.sh
if [ $? -eq 0 ]
then
archive_suffix=""
else
archive_suffix=_with_error
fi
./archive_logs.sh "_after_node_${i}_recovered${archive_suffix}"
done
date all of the other scripts make And how the cluster was percieved by our monitoring: ![]() |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Classic mirrored queues are no longer maintained and will be removed from RabbitMQ in 4.0 in the first half of 2024. Please migrate to quorum queues and/or streams. Or non-mirrored classic queues which do not have unfixable design issues and are still maintained. |
Beta Was this translation helpful? Give feedback.
-
Two more things to consider:
Publishing without confirms is the most common data safety issue in such tests. For example, the original Jepsen test issued automatic (on the consumer side) acknowledgments at some point, unintentionally. |
Beta Was this translation helpful? Give feedback.
Classic mirrored queues are no longer maintained and will be removed from RabbitMQ in 4.0 in the first half of 2024.
Please migrate to quorum queues and/or streams. Or non-mirrored classic queues which do not have unfixable design issues and are still maintained.