Published messages are lost until the original host of a QQ leader replica comes back up (even after a new leader is elected) #9209
-
Describe the bugI have a 3 node RabbitMQ cluster. All queues are quorum type. Telegraf writing AMQP to 5672, has all three nodes configured, picks one at random. If one node disappears offline (crash/network fail/hard powerdown/etc):
Clearly, 3 is a problem. Restarting Telegraf (new connection) immediately fixes this. The moment the downed node comes back online, messages appear in the queue. No indication from Telegraf (with debug logging) that anything changed, just from one logged write to the next, it starts to work again. As long as the RabbitMQ node stays down and Telegraf is not restarted, messages are lost. (I left it running for 5-10 minutes, should be plenty long enough.) I have tested this with simple python writer/consumer scripts and haven't been able to reproduce it there, just with Telegraf. But because it immediately starts working again when the downed node comes back online and messages are Reproduction stepsCase 3 from above:
Expected behaviorI would expect to be able to lose a node from a 3-node cluster and not have (significant) data loss. One or two lost writes due to not using delivery_mode = "persistent" is my call, but no data for the 5 minutes or however long the node is down is clearly not how this should be. Additional contextRabbitMQ Tracing log(rabbitmq_tracing plugin)
Grepped 'Message' for clarity. The rest checks out, the raw data is correct too. Note: The "received" log entry is on being consumed from the queue, not on being queued per se, but the consumer process is connected as usual for the whole 5 minutes and the queue isn't filling up either. As soon as the missing node comes back, the data comes through again. Rabbit-3 debug log
Telegraf debug log(Note 2 hour timezone difference, this one logs UTC but it's the same event as above.)
NO indication whatsoever from Telegraf that something is wrong. Partly this is Telegraf not running in delivery_mode = "persistent", the two missing writes as it figures out it doesn't have a connection anymore I can handle. But the five minutes of writes after reconnecting at 13:37:08 actually make it into RabbitMQ, they just don't make it out again until rabbit-2 rejoins. Versions
Config Telegraf
Config RabbitMQ
/etc/rabbitmq/rabbitmq.conf:
|
Beta Was this translation helpful? Give feedback.
Replies: 15 comments 61 replies
-
Quorum queues do not lose data when their leader goes down, we have been running Jepsen tests continuously for more than six years now to prove it. Consumers can consume from the moment a new leader is elected but it is extremely important to understand what the Telegraph consumer does when it recovers. It should not matter when node 2 rejoins the cluster because its replicas will rejoin as followers. Whether messages are published as persistent or not should not matter to QQs, they always store all data on disk, the transient delivery mode is ignored. Whether the publisher uses confirms matters a lot more. Our team uses PerfTest to simulate workloads. Please reproduce the behavior using PerfTest and let us know what flags were used. |
Beta Was this translation helpful? Give feedback.
-
@tubemeister can you please clarify two more things?
|
Beta Was this translation helpful? Give feedback.
-
Note that the Telegraf in question is a publisher, not a consumer. The telegraf publisher reconnect to a different node and starts writing, as shown in the included telegraf log, while the first node is down. Those writes make it to rabbitmq, as shown in the included trace log. But they don't make it into the actual queue. It's not the queue losing any data that's already in it, it's about data not making it into the queue when a node is down. It's as if the binding is not there while one node is down. The consumer isn't the problem, there might as well be no consumer, then it would just be messages piling up in a queue. Re your questions:
The tracing log shows they are not being consumed, and given that writes are logged in that same tracing log but the queue remains empty it follows that they are lost somewhere between exchange and queue. The missing node rejoining is simple, it shows green in the management interface (and the management interface stops having 40-50 second timeouts, but that's a different subject) and it shows up as reconnected in the log for rabbit-3 as included. And when it does, suddenly messages that have been writing at 10s intervals for the past 5 minutes DO appear in the queue. I don't know what you mean by re-register a consumer, the downed node doesn't have consumers on it wen it comes back, the one consumer is not connected to that node. The consumer doesn't really matter at all, as soon as that downed node rejoins the cluster, the queue starts filling up. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@tubemeister I have provided evidence that with a well-behaved client (PerfTest), a quorum queue elects a new leader and the flow of deliveries continues with a very brief interruption for QQ leader election. I have listed the steps and PerfTest flags that were used. Leader election on QQs and streams are transparent to consumers. I cannot immediately think of a scenario where a consumer not connected to the killed node would have to manually re-register. Therefore, either the consumer is connected to the failed node, or consumer re-registration with |
Beta Was this translation helpful? Give feedback.
-
FTR, the Telegraph RabbitMQ output (that seemingly does the publishing) does not use publisher confirms. |
Beta Was this translation helpful? Give feedback.
-
After clarifying a few things, the scenario to test seems to be this:
|
Beta Was this translation helpful? Give feedback.
-
So they were not lost - just delayed? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Per discussion with @kjnilsson: our working hypothesis is that the publishing channel does not observe a QQ leader identifier change in some cases. Collecting publishing channel process state and comparing it to queue metrics would help validate this hypothesis. So far, coming up with a snippet to run via remote (Erlang) shell or |
Beta Was this translation helpful? Give feedback.
-
If you have the channel pid then sys:get_state/1 should get it all
On Tue, 29 Aug 2023 at 18:21, Michael Klishin ***@***.***> wrote:
rabbit_channel:list_queue_states/1 only exposes a queue resource name,
that's not useful in this case. So the state is completely opaque, we need
a way to reproduce locally in order to prove this hypothesis right or wrong.
—
Reply to this email directly, view it on GitHub
<#9209 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJAHFGJC2IVTAV5JFMAAODXXYQK7ANCNFSM6AAAAAA4DDDAOI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
--
*Karl Nilsson*
|
Beta Was this translation helpful? Give feedback.
-
Right. Ran the test scenario again. Attached a tarball with all logfiles and this timeline again. Basics
All predeclared. Publisher needs to be connected to the queue leader. That server then goes hard-offline without a chance to clean up connections or anything. This is simulating a hard crash or network separation of the server or VM or docker container or whatever, not just the rabbitmq-server process being down on an otherwise running and connected machine. IPs172.17.255.16: Telegraf, publisher TimelineNote: Telegraf log is UTC, everything else is CEST/UTC+2
|
Beta Was this translation helpful? Give feedback.
-
@tubemeister after doing a brief review of the |
Beta Was this translation helpful? Give feedback.
-
Been a busy couple of days, I've finally managed to bundle/clean up the log files from two runs I did last week, one failure one working. This is still on the 3.13 alpha. I'll include them here. Looks like I'm going to be away from the internet for the rest of the week. Next week I'll rebuild my test cluster with the 3.12 alpha and run the test again. Thanks for all the effort so far. |
Beta Was this translation helpful? Give feedback.
-
3.12.6 installed tested and working, thanks everyone. |
Beta Was this translation helpful? Give feedback.
We will likely ship
3.12.5
early next week. It has something things in flight that I would not rush.