Published messages are lost until the original host of a QQ leader replica comes back up (even after a new leader is elected) #9209

tubemeister · 2023-08-29T13:56:13Z

tubemeister
Aug 29, 2023

Describe the bug

I have a 3 node RabbitMQ cluster. All queues are quorum type. Telegraf writing AMQP to 5672, has all three nodes configured, picks one at random.

If one node disappears offline (crash/network fail/hard powerdown/etc):

If it isn't the leader for the queue, things recover quickly.
If it IS the leader for the queue but not the node Telegraf connected to, things recover quickly.
If it's both the leader and the node Telegraf is connected to, Telegraf reconnects to a different node and writes happen as usual but no data appears in the queue. That data is lost.

Clearly, 3 is a problem.

Restarting Telegraf (new connection) immediately fixes this.

The moment the downed node comes back online, messages appear in the queue. No indication from Telegraf (with debug logging) that anything changed, just from one logged write to the next, it starts to work again.

As long as the RabbitMQ node stays down and Telegraf is not restarted, messages are lost. (I left it running for 5-10 minutes, should be plenty long enough.)

I have tested this with simple python writer/consumer scripts and haven't been able to reproduce it there, just with Telegraf. But because it immediately starts working again when the downed node comes back online and messages are
actually arriving at RabbitMQ, I think this is not "just" a Telegraf problem.

Reproduction steps

Case 3 from above:

Queue leader is rabbit-2
Telegraf connected on rabbit-2
Kill rabbit-2 (hard powerdown)
Queue leader is now rabbit-3
Telegraf reconnects to rabbit-3
Traces (rabbitmq_tracing plugin) only shows published, none received
When rabbit-2 comes back online, from that moment on messages are being received. Anything in between is lost.

Expected behavior

I would expect to be able to lose a node from a 3-node cluster and not have (significant) data loss.

One or two lost writes due to not using delivery_mode = "persistent" is my call, but no data for the 5 minutes or however long the node is down is clearly not how this should be.

Additional context

RabbitMQ Tracing log

(rabbitmq_tracing plugin)

2023-08-28T15:36:23.448+02:00: Message published
2023-08-28T15:36:23.460+02:00: Message received
2023-08-28T15:36:33.449+02:00: Message published
2023-08-28T15:36:33.463+02:00: Message received
[rabbit-2 down]
2023-08-28T15:37:08.469+02:00: Message published
2023-08-28T15:37:16.640+02:00: Message published
2023-08-28T15:37:23.460+02:00: Message published
2023-08-28T15:37:33.456+02:00: Message published
2023-08-28T15:37:43.454+02:00: Message published
...
2023-08-28T15:41:23.469+02:00: Message published
2023-08-28T15:41:33.470+02:00: Message published
[rabbit-2 back online]
2023-08-28T15:41:43.471+02:00: Message published
2023-08-28T15:41:43.537+02:00: Message received
2023-08-28T15:41:53.471+02:00: Message published
2023-08-28T15:41:53.484+02:00: Message received

Grepped 'Message' for clarity. The rest checks out, the raw data is correct too.

Note: The "received" log entry is on being consumed from the queue, not on being queued per se, but the consumer process is connected as usual for the whole 5 minutes and the queue isn't filling up either. As soon as the missing node comes back, the data comes through again.

Rabbit-3 debug log

[rabbit-2 down]
2023-08-28 15:36:39.900248+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': Leader node 'rabbit@rabbit-2' may be down, setting pre-vote timeout
2023-08-28 15:36:45.221649+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': pre_vote election called for in term 41
2023-08-28 15:36:45.246315+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': follower -> pre_vote in term: 41 machine version: 3
2023-08-28 15:36:45.246522+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': pre_vote granted #Ref<0.639235038.3457679361.130581> for term 41 votes 1
2023-08-28 15:36:45.249362+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': pre_vote granted #Ref<0.639235038.3457679361.130581> for term 41 votes 2
2023-08-28 15:36:45.249472+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': election called for in term 42
2023-08-28 15:36:45.261991+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': pre_vote -> candidate in term: 42 machine version: 3
2023-08-28 15:36:45.262255+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': vote granted for term 42 votes 1
2023-08-28 15:36:45.283276+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': vote granted for term 42 votes 2
2023-08-28 15:36:45.283635+02:00 [notice] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': candidate -> leader in term: 42 machine version: 3
2023-08-28 15:36:45.302295+02:00 [debug] <0.245491.0> queue 'influxdb1' in vhost 'timeseries': enabling ra cluster changes in 42, index 2050010
2023-08-28 15:37:08.460076+02:00 [debug] <0.260538.0> Client address during authN phase: {0,0,0,0,0,65535,44049,57619}
2023-08-28 15:37:08.460278+02:00 [debug] <0.260538.0> User 'telegraf' authenticated successfully by backend rabbit_auth_backend_internal
2023-08-28 15:37:14.327648+02:00 [error] <0.245009.0> ** Node 'rabbit@rabbit-2' not responding **
2023-08-28 15:37:14.327648+02:00 [error] <0.245009.0> ** Removing (timedout) connection **
2023-08-28 15:37:14.327648+02:00 [error] <0.245009.0>
2023-08-28 15:37:14.328184+02:00 [info] <0.245368.0> rabbit on node 'rabbit@rabbit-2' down
2023-08-28 15:37:14.329187+02:00 [warning] <0.257250.0> Management delegate query returned errors:
2023-08-28 15:37:14.329187+02:00 [warning] <0.257250.0> [{<14073.278056.0>,{exit,{nodedown,'rabbit@rabbit-2'},[]}}]
2023-08-28 15:37:14.330408+02:00 [debug] <0.260532.0> queue 'influxdb1' in vhost 'timeseries': repairing leader record
2023-08-28 15:37:14.353668+02:00 [info] <0.245368.0> 1 transient queues from an old incarnation of node 'rabbit@rabbit-2' deleted in 0.003176s
2023-08-28 15:37:16.639597+02:00 [info] <0.245368.0> node 'rabbit@rabbit-2' down: net_tick_timeout
[rabbit-2 back online]
2023-08-28 15:41:37.958785+02:00 [info] <0.245368.0> node 'rabbit@rabbit-2' up
2023-08-28 15:41:40.621086+02:00 [info] <0.245368.0> rabbit on node 'rabbit@rabbit-2' up
[messages coming through the queue]
2023-08-28 15:50:41.307993+02:00 [warning] <0.245364.0> epmd does not know us, re-registering rabbit at port 25672

Telegraf debug log

(Note 2 hour timezone difference, this one logs UTC but it's the same event as above.)

[connected to rabbit-2]
...
2023-08-28T13:36:23Z D! [outputs.amqp] Wrote batch of 26 metrics in 474.541µs
2023-08-28T13:36:23Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-08-28T13:36:33Z D! [outputs.amqp] Wrote batch of 26 metrics in 434.556µs
2023-08-28T13:36:33Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
[rabbit-2 down]
2023-08-28T13:36:43Z D! [outputs.amqp] Wrote batch of 26 metrics in 337.538µs
2023-08-28T13:36:43Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-08-28T13:36:53Z D! [outputs.amqp] Wrote batch of 26 metrics in 403.03µs
2023-08-28T13:36:53Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
[2 missing writes to the dead node, ok]
2023-08-28T13:37:03Z D! [outputs.amqp] Connecting to "amqp://rabbit-2:5672/timeseries"
2023-08-28T13:37:08Z D! [outputs.amqp] Error connecting to "amqp://rabbit-2:5672/timeseries" - dial tcp 172.17.225.17:5672: i/o timeout
2023-08-28T13:37:08Z D! [outputs.amqp] Connecting to "amqp://rabbit-3:5672/timeseries"
2023-08-28T13:37:08Z D! [outputs.amqp] Connected to "amqp://rabbit-3:5672/timeseries"
[from here on they show up as published but not received in the tracing log]
2023-08-28T13:37:08Z D! [outputs.amqp] Wrote batch of 26 metrics in 5.018372215s
2023-08-28T13:37:08Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-08-28T13:37:13Z D! [outputs.amqp] Wrote batch of 26 metrics in 643.37µs
2023-08-28T13:37:13Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-08-28T13:37:23Z D! [outputs.amqp] Wrote batch of 26 metrics in 532.704µs
2023-08-28T13:37:23Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
...
2023-08-28T13:41:23Z D! [outputs.amqp] Wrote batch of 26 metrics in 414.452µs
2023-08-28T13:41:23Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-08-28T13:41:33Z D! [outputs.amqp] Wrote batch of 26 metrics in 454.038µs
2023-08-28T13:41:33Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
[rabbit-2 back online - from here on they make it into the queue]
2023-08-28T13:41:43Z D! [outputs.amqp] Wrote batch of 26 metrics in 489.908µs
2023-08-28T13:41:43Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-08-28T13:41:53Z D! [outputs.amqp] Wrote batch of 26 metrics in 456.869µs
2023-08-28T13:41:53Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics

NO indication whatsoever from Telegraf that something is wrong.

Partly this is Telegraf not running in delivery_mode = "persistent", the two missing writes as it figures out it doesn't have a connection anymore I can handle. But the five minutes of writes after reconnecting at 13:37:08 actually make it into RabbitMQ, they just don't make it out again until rabbit-2 rejoins.

Versions

Ubuntu 22.04 LTS
RabbitMQ server 3.12.4 (from https://packagecloud.io/rabbitmq/rabbitmq-server/ubuntu/)
RabbitMQ erlang 1:26.0.2-1 (from http://ppa.launchpad.net/rabbitmq/rabbitmq-erlang/ubuntu)
Telegraf 1.27.4 (from https://repos.influxdata.com/ubuntu)

Config Telegraf

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"

  flush_interval = "10s"
  flush_jitter = "0s"

  debug = true
  quiet = false

[[outputs.amqp]]
  brokers = [ "amqp://rabbit-1:5672/timeseries", "amqp://rabbit-2:5672/timeseries", "amqp://rabbit-3:5672/timeseries" ]
  exchange = "telegraf"
  username = "telegraf"
  password = ...

[[inputs.mem]]
...
(more inputs, not relevant)

Config RabbitMQ

vhost 'timeseries'
exchange 'telegraf' (durable)
queue 'influxdb1' (durable, quorum)
binding ^^ those two

/etc/rabbitmq/rabbitmq.conf:

# General
heartbeat = 15      # default 60
net_ticktime = 30   # default 60

# Cluster config
cluster_partition_handling = pause_minority

cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = rabbit@rabbit-1
cluster_formation.classic_config.nodes.2 = rabbit@rabbit-2
cluster_formation.classic_config.nodes.3 = rabbit@rabbit-3

# SSL config
listeners.ssl.default = 5671

ssl_options.cacertfile = /etc/rabbitmq/ssl/rabbitmq.chain
ssl_options.certfile   = /etc/rabbitmq/ssl/rabbitmq.crt
ssl_options.keyfile    = /etc/rabbitmq/ssl/rabbitmq.key
ssl_options.verify     = verify_peer
ssl_options.fail_if_no_peer_cert = false

# Management interface
management.tcp.port       = 15672
management.ssl.port       = 15671
management.ssl.cacertfile = /etc/rabbitmq/ssl/rabbitmq.chain
management.ssl.certfile   = /etc/rabbitmq/ssl/rabbitmq.crt
management.ssl.keyfile    = /etc/rabbitmq/ssl/rabbitmq.key

Answered by michaelklishin

Sep 13, 2023

We will likely ship 3.12.5 early next week. It has something things in flight that I would not rush.

View full answer

michaelklishin · 2023-08-29T14:58:59Z

michaelklishin
Aug 29, 2023
Maintainer

Quorum queues do not lose data when their leader goes down, we have been running Jepsen tests continuously for more than six years now to prove it.

Consumers can consume from the moment a new leader is elected but it is extremely important to understand what the Telegraph consumer does when it recovers.

It should not matter when node 2 rejoins the cluster because its replicas will rejoin as followers.
Consumers can consume from QQ followers but any re-registered consumer should resume consumption from one of the replicas that are online.

Whether messages are published as persistent or not should not matter to QQs, they always store all data on disk, the transient delivery mode is ignored. Whether the publisher uses confirms matters a lot more.

Our team uses PerfTest to simulate workloads. Please reproduce the behavior using PerfTest and let us know what flags were used.

0 replies

michaelklishin · 2023-08-29T15:02:09Z

michaelklishin
Aug 29, 2023
Maintainer

@tubemeister can you please clarify two more things?

What is publishing those messages? Is it Telegraph itself? Does it use publisher confirms? How does that client handle failures? Can it be that it tries to reconnect to node 2 only? That would explain the behavior very nicely
When you say "messages are lost", does it mean that they are not enqueued or simply that they are not delivered to a consumer until a certain node rejoins? How do you observe that said node does reconnect and re-register a consumer? See Internal events that would help you observe any new connections, consumer registrations, and so on

0 replies

tubemeister · 2023-08-29T15:29:21Z

tubemeister
Aug 29, 2023
Author

Note that the Telegraf in question is a publisher, not a consumer. The telegraf publisher reconnect to a different node and starts writing, as shown in the included telegraf log, while the first node is down. Those writes make it to rabbitmq, as shown in the included trace log. But they don't make it into the actual queue.

It's not the queue losing any data that's already in it, it's about data not making it into the queue when a node is down. It's as if the binding is not there while one node is down.

The consumer isn't the problem, there might as well be no consumer, then it would just be messages piling up in a queue.

Re your questions:

Telegraf is publishing these messages. See included config. It doesn't do publisher confirms, as stated in the bugreport, which explains the initial two missing messages before Telegraf reconnects, as annotated in the logfile. It reconnects to a different node on failure, as seen in the included Telegraf logfile. It's happily logging writes that make it to rabbitmq as confirmed by the trace log, and keeps identically logging writes before and after the downed node comes back.
When I say messages are lost I mean messages that make it into rabbitmq (trace log, again) somehow don't make it into the actual queue. They are not enqueued, they are lost.

The tracing log shows they are not being consumed, and given that writes are logged in that same tracing log but the queue remains empty it follows that they are lost somewhere between exchange and queue.

The missing node rejoining is simple, it shows green in the management interface (and the management interface stops having 40-50 second timeouts, but that's a different subject) and it shows up as reconnected in the log for rabbit-3 as included. And when it does, suddenly messages that have been writing at 10s intervals for the past 5 minutes DO appear in the queue.

I don't know what you mean by re-register a consumer, the downed node doesn't have consumers on it wen it comes back, the one consumer is not connected to that node. The consumer doesn't really matter at all, as soon as that downed node rejoins the cluster, the queue starts filling up.

8 replies

michaelklishin Aug 29, 2023
Maintainer

given that writes are logged in that same tracing log but the queue remains empty it follows that they are lost somewhere between exchange and queue

is specifically the behavior that can be explained by publishing channels on other nodes delaying
publishers because the target does not have an elected leader. Those messages are not lost, they will be routed when the new leader is elected and the publishing node find out about it.

tubemeister Aug 29, 2023
Author

A new leader is established, it's rabbit-3. For some reason this only shows up in the log on rabbit-1:

2023-08-28 15:36:45.284398+02:00 [debug] <0.257171.0> rabbit_fifo_client: Detected QQ leader change from {timeseries_influxdb1,'rabbit@rabbit-2'} to {timeseries_influxdb1,'rabbit@rabbit-3'}

At that point, the publisher (telegraf) reconnects to rabbit-3 and starts writing. Writes that don't arrive in a queue.

The consumer also correctly recovers, connects to a different node and waits for messages. No messages arrive in the queue as long as the node is down, and they do arrive when the downed node recovers.

I understand how it's supposed to work. I filed a bug report because from my point of view it seems to not be working as it's supposed to.

tubemeister Aug 29, 2023
Author

is specifically the behavior that can be explained by publishing channels on other nodes delaying
publishers because the target does not have an elected leader. Those messages are not lost, they will be routed when the new
leader is elected and the publishing node find out about it.

But they are lost. They never show up. They are logged for 5 minutes while rabbit-2 is down, and any writes after rabbit-2 comes back up are logged and read from the queue. The writes for those 5 minutes never show up anywhere, ever.

michaelklishin Aug 29, 2023
Maintainer

@tubemeister OK, OK, they are lost for you, they never ever show up. Now let's focus on you helping us understand what is going on in your system. I have listed two similar scenarios (scenario 1 seems to be more or less what you are describing) and the behavior is what you'd expect. I have spent some time trying to reproduce this behavior and could not.

Can you enable debug logging and share log files from all nodes for the entire duration of the test?

tubemeister Aug 29, 2023
Author

First of all, you are killing the rabbitmq process, not the server. Not 100% sure if that makes a difference but it probably does. Things being down and connections being dead are noticed a lot quicker if the underlying server is still there to give answers.

Your scenario 1 and 2 below are roughly my scenario 1 and 2 from the initial bug report, and scenario 3 where the publisher is connected to the QQ leader is where things went wrong...

I'll rerun the test and include the logfiles from all nodes later, it's time to eat now...

michaelklishin · 2023-08-29T15:37:59Z

michaelklishin
Aug 29, 2023
Maintainer

So I have tried two scenarios and cannot reproduce this behavior with PerfTest (which handles connection failures, publisher confirms, and so on)

Scenario one: kill a node no clients are connected to

Start a three node cluster from the v3.12.x branch
Observe a QQ with a leader on node two (rabbit-1 out of 0, 1, 2):
Start a PerfTest publisher with publisher confirms, connected to node 0: ./bin/runjava com.rabbitmq.perf.PerfTest --uri "amqp://localhost:5672" -x 1 -y 0 --quorum-queue --queue "server.9209" -f persistent -c 100
Start a PerfTest consumer connected to node 0 (or 2): ./bin/runjava com.rabbitmq.perf.PerfTest --uri "amqp://localhost:5674" -x 0 -y 4 --quorum-queue --queue "server.9209" -f persistent
Observe a flow of messages
Find out what the rabbit-1 OS pid is with rabbitmqctl status -n rabbit-1@sunnyside | ag "OS PID"
kill -9 that process
Observe a new leader election
Observe a continued flow of messages

Scenario two: kill a node the publisher is connected to

Start a three node cluster from the v3.12.x branch
Observe a QQ with a leader on node two (rabbit-1 out of 0, 1, 2):
Start a PerfTest publisher with publisher confirms, connected to node 1: ./bin/runjava com.rabbitmq.perf.PerfTest --uri "amqp://localhost:5673" -x 1 -y 0 --quorum-queue --queue "server.9209" -f persistent -c 100
Start a PerfTest consumer connected to node two: ./bin/runjava com.rabbitmq.perf.PerfTest --uri "amqp://localhost:5673" -x 0 -y 4 --quorum-queue --queue "server.9209" -f persistent
Observe a flow of messages
Find out what the rabbit-1 OS pid is with rabbitmqctl status -n rabbit-1@sunnyside | ag "OS PID"
kill -9 that process
Observe a new leader election
Restart the publisher so that it reconnects to node one: ./bin/runjava com.rabbitmq.perf.PerfTest --uri "amqp://localhost:5672" -x 1 -y 0 --quorum-queue --queue "server.9209" -f persistent -c 100
Restart the consumer so that it reconnects to node three: ./bin/runjava com.rabbitmq.perf.PerfTest --uri "amqp://localhost:5674" -x 0 -y 4 --quorum-queue --queue "server.9209" -f persistent
Observe a continued flow of messages even though node two (rabbit-1 was never restarted)
Restart node 1
Observe a continued flow of messages

2 replies

tubemeister Aug 29, 2023
Author

I'll try to replicate scenario 3 (the one that actually failed) tomorrow...

tubemeister Aug 30, 2023
Author

Question.

Is there a way to get perftest to reconnect to a different server? I've started mine with --uris rabbit1,rabbit2,rabbit3 and a single producer and consumer, and when I take down rabbit1, it throws java errors for as long as that node is down and doesn't seem to fail over to a second node. As soon as rabbit1 is back online, it reconnects and carries on.

java -jar perf-test-2.19.0.jar --predeclared --consumers 1 --producers 1 --exchange test123 --queue test123 --id perftest --uris rabbit1,rabbit2,rabbit3 --rate 0.1

09:53:16.987 [AMQP Connection 172.17.225.19:5672] ERROR c.r.perf.RelaxedExceptionHandler - Caught an exception during connection recovery!
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.base/java.net.Socket.connect(Socket.java:609)
at com.rabbitmq.client.impl.SocketFrameHandlerFactory.create(SocketFrameHandlerFactory.java:59)
at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:63)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:623)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:584)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:519)
at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:817)
at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:794)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:678)
at java.base/java.lang.Thread.run(Thread.java:829)

michaelklishin · 2023-08-29T15:43:05Z

michaelklishin
Aug 29, 2023
Maintainer

@tubemeister I have provided evidence that with a well-behaved client (PerfTest), a quorum queue elects a new leader and the flow of deliveries continues with a very brief interruption for QQ leader election.

I have listed the steps and PerfTest flags that were used.

Leader election on QQs and streams are transparent to consumers. I cannot immediately think of a scenario where a consumer not connected to the killed node would have to manually re-register.
That's all done for them, once a new leader is elected, and has been the case even with classic queues
long before QQ development.

Therefore, either the consumer is connected to the failed node, or consumer re-registration with
the new leader does not happen in your environment. To understand why that may be, we need you
to reproduce this behavior using PerfTest (so that we use a common set of tools) and provide logs from all nodes (they will demonstrate leader elections, any possible errors or exceptions, and client connections).

3 replies

tubemeister Aug 29, 2023
Author

The consumer is irrelevant. Forget the consumer. Messages aren't making it to the QQ for as long as one node is down.

I'm not sure I can replicate that with perftest as it doesn't happen with my basic test script either. But it happens with Telegraf. And Telegraf logs its writes, and RabbitMQ logs those writes, and nothing in the Telegraf logs indicates any change in before or after the downed node comes back up. But RabbitMQ does, it shows those writes don't make it to a queue as long as the node is down, and DO make it to a queue as soon as the downed node is back. Apparently transparent to the publisher.

Now you could dismiss this as a Telegraf problem, and I suspect you are doing exactly that, but it's still curious that those writes DO make it onto RabbitMQ and disappear there.

I don't understand how Telegraf could be doing something to influence whether or not messages make it from the exchange to the queue, something that is dependent on a node that Telegraf isn't connected to being down.

michaelklishin Aug 29, 2023
Maintainer

@tubemeister we do not maintain Telegraph. Either we find a way to reproduce this with one of the clients and tools maintained by our team, or you can ask the Telegraph maintainers for help.

One thing we know now from the Telegraph source is that their publishing code does not use the key publisher-side data safety mechanism.

michaelklishin Aug 29, 2023
Maintainer

I don't understand how Telegraf could be doing something to influence whether or not messages make it from the exchange to the queue

Something in the connection and channel state — in part controlled by clients — may be influencing this behavior.

Anyhow, if my understanding of the scenario (listed below) is correct, we need to try reproducing
that exact scenario and logs from all nodes, in particular node 3 (where the publisher connects after the leader failure).

michaelklishin · 2023-08-29T15:57:22Z

michaelklishin
Aug 29, 2023
Maintainer

FTR, the Telegraph RabbitMQ output (that seemingly does the publishing) does not use publisher confirms.

0 replies

michaelklishin · 2023-08-29T16:03:07Z

michaelklishin
Aug 29, 2023
Maintainer

After clarifying a few things, the scenario to test seems to be this:

A publisher without confirms is connected to node 1. Node 2 hosts a QQ leader
A flow of messages to the QQ demonstrates correct routing
Node 2 is killed
Node 3 now hosts a leader
The publisher seemingly connects to node 3
Until node 2 comes up, the tracing plugin logs published messages but they are never enqueued
When node 2 comes back up, the flow resumes

6 replies

tubemeister Aug 29, 2023
Author

The scenario that goes wrong is both publisher and QQ leader on the same node, and killing that node. The server, not the process.
Publisher then connects to the new QQ leader, messages are written to the rabbitmq exchange and logged as such, but not appearing in the queue.

michaelklishin Aug 29, 2023
Maintainer

It does not matter to other QQ replicas whether you kill the RabbitMQ process or the entire host (VM, container).

michaelklishin Aug 29, 2023
Maintainer

@tubemeister thanks for the details. I am trying to reproduce this now. We need logs from you, specifically from node 3, for the entire duration of the test.

tubemeister Aug 29, 2023
Author

It matters in how quickly things get resolved I'd imagine.

It certainly matters to the management interface, but that's a different bugreport.

michaelklishin Aug 29, 2023
Maintainer

@tubemeister focus, please. If you are interested in helping the core team, we need logs from all three nodes over the complete duration of your test that reproduces this behavior. No "different bug reports" in this discussion, please, right now we don't have enough information to reproduce, so this is not really a "bug report".

kjnilsson · 2023-08-29T16:31:29Z

kjnilsson
Aug 29, 2023
Maintainer

And when it does, suddenly messages that have been writing at 10s intervals for the past 5 minutes DO appear in the queue.

So they were not lost - just delayed?

9 replies

tubemeister Aug 29, 2023
Author

One slightly imprecise phrasing taken out of context versus a few pages of trying to explain exactly how and where and when these messages DO NOT get delivered to the queue.

Messages DO NOT get delivered to the queue as long as one node is down
The moment said node rejoins the cluster, messages written FROM THEN ON start getting delivered to the queue.

Hope this clears that up.

lukebakken Aug 29, 2023
Maintainer

We get reports all of the time where people claim RabbitMQ loses messages, so being precise is important. In almost every case it is due to user error in their code.

Even now, your phrasing states that some messages are lost permanently. I'm just following from the sidelines - thankfully @michaelklishin and @kjnilsson seem to have a better handle on it.

tubemeister Aug 30, 2023
Author

Yeah I'm sure you do, and that would explain the slightly aggressive way in which The Way It Works is explained while apparently very much wanting to read towards certain conclusions.

I don't really care whose fault it is or if it's a bona fide bug or not, I've just been fighting with this thing for over a week because I AM LOSING DATA. As stated over and over and over and over again. I am losing data INSIDE RabbitMQ and it is completely not obvious to me why. The basic documentation seems to suggest that publishers publish to an exchange and it's up to RabbitMQ to determine which queue(s) the messages end up in, not the publisher. I'm sure there's ways around that and it's more complicated than that, but my not-rabbitmq-maintainer outsider brain does consider it logical that if messages get published, as logged by both the publisher and RabbitMQ itself, and then disappear between two components INSIDE RabbitMQ, then the fairly obvious candidate for something being wrong is RabbitMQ.

Hence the bug report...

If it turns out to be a setting somewhere that's wrong, hey all the better, I'll document the thing here internally and on we go. But for the time being I have a three node RabbitMQ cluster that is functionally equivalent to a standalone machine and thus not production ready.

And now, off to reset the stunt bunny with all the logging and bells an whistles turned on and run a few more tests...

kjnilsson Aug 30, 2023
Maintainer

@tubemeister we need full logs at debug level from all nodes covering the the time before, during and after the node going down and coming back up again. Do you think you can provide this to us?

FWIW I think you are right, there is something going on in the broker that may need a fix but without a runnable repro, our best current way forward is to inspect the logs.

I am sure you can appreciate that there is some reluctance spending substantial amount of time (we're a small team) investigating message "loss" when the publisher is not using publisher confirms. Not using publisher confirms basically says the publisher doesn't care if the message gets to the queue or not. Publishers that do not use confirms combined with client side buffering and resends will at some point lose messages. The broker cannot provide any other guarantees in this case. See: https://www.rabbitmq.com/quorum-queues.html#use-cases

It is quite likely that if the publisher did use confirms they would get nack notifications for these messages and thus know the outcome.

tubemeister Aug 30, 2023
Author

I get that, and indeed I'm not using confirms because at this point a few lost messages are not a big deal with timeseries data. Like I said in the initial bugreport, I can deal with losing a few messages during the timeout of a connection loss. But losing messages for the entire duration of a node being down is quite something else, especially if they disappear after having succesfully made it into rabbitmq. I figured you'd want to know about something like that... ;-)

As one of you mentioned elsewhere Telegraf doesn't seem to have confirms so that makes testing that a bit tricky. You're right that it probably doesn't happen with confirms on. I have a different test script with confirms on and that just fails and then reconnects again, and restarting the connection after rabbitmq has reconfigured itself fixes the problem. But manually restarting telegraf on a few hundred hosts during a rabbitmq outage is clearly not a viable plan.

I've been trying to make perftest work but so far that doesn't reconnect to a different server at all.

I tried my python test scripts without confirms, and that shows similar but slightly different behaviour, where after the downed node comes back I do get a pile of messages all at once, but still not every one from after the publisher reconnected.

So I'll leave those for now and go set up my original test case with all debug logging.

michaelklishin · 2023-08-29T16:34:32Z

michaelklishin
Aug 29, 2023
Maintainer

More unsuccessful reproduction steps with the refined understanding of the scenario are below.

Scenario 3

Start a three node cluster
Declare a QQ, it gets a leader hosted on rabbit-0 (node one)
Start a publisher that mimics the Telegraph output: it DOES NOT use publisher confirms and has three endpoints to connect to: ./bin/runjava com.rabbitmq.perf.PerfTest --uris "amqp://localhost:5672,amqp://localhost:5673,amqp://localhost:5674" -x 1 -y 0 --quorum-queue --queue "server.9209" -f persistent
Stop node one (rabbit-0). I used rabbitmqctl stop_app so that it is easier to restart but it can be a kill -9, a container or VM termination, et cetera
Observe a leader election and a continued flow of messages after the publisher reconnects to another node

So, this round demonstrated that QQs behave as expected in this test cluster.

1 reply

tubemeister Aug 29, 2023
Author

Stop node one (rabbit-0). I used rabbitmqctl stop_app so that it is easier to restart but it can be a kill -9, a container or VM termination, et cetera

These details all make a difference. Clean shutdowns of the server process work fine indeed. It's specifically when it isn't a clean shutdown of the SERVER that I'm observing this scenario.

And if I'm reading this correctly, this is leader on node0, producer on node1, which is my scenario 2 up in the first post, and is working.

michaelklishin · 2023-08-29T17:10:59Z

michaelklishin
Aug 29, 2023
Maintainer

Per discussion with @kjnilsson: our working hypothesis is that the publishing channel does not observe a QQ leader identifier change in some cases.

Collecting publishing channel process state and comparing it to queue metrics would help validate this hypothesis.

So far, coming up with a snippet to run via remote (Erlang) shell or rabbitmqctl eval in a node that is in this delaying state, has proven to be fairly difficult. The queue to leader identifier mapping is entirely internal to channels. With local access to the node using Erlang tooling for process inspection would be a viable alternative.

1 reply

michaelklishin Aug 29, 2023
Maintainer

rabbit_channel:list_queue_states/1 only exposes a queue resource name, that's not useful in this case. So the state is completely opaque, we need a way to reproduce locally in order to prove this hypothesis right or wrong.

kjnilsson · 2023-08-29T18:11:46Z

kjnilsson
Aug 29, 2023
Maintainer

If you have the channel pid then sys:get_state/1 should get it all

On Tue, 29 Aug 2023 at 18:21, Michael Klishin ***@***.***> wrote: rabbit_channel:list_queue_states/1 only exposes a queue resource name, that's not useful in this case. So the state is completely opaque, we need a way to reproduce locally in order to prove this hypothesis right or wrong. — Reply to this email directly, view it on GitHub <#9209 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJAHFGJC2IVTAV5JFMAAODXXYQK7ANCNFSM6AAAAAA4DDDAOI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

-- *Karl Nilsson*

1 reply

michaelklishin Aug 29, 2023
Maintainer

Then to collect the relevant data, the following should be enough. Assuming node rabbit-1 is in the "delaying state" and the name of the queue is queue-9209:

Display the QQ leader Erlang pid of the queue with rabbitmqctl list_queues name pid -n rabbit-1 --formatter=pretty_table | grep "queue-9209"
Display the publishing channel Erlang pid with rabbitmqctl list_channels name connection pid -n rabbit-1 --formatter=pretty_table

Display internal state of all local channels on the node to which the publisher is connected:

rabbitmqctl eval 'lists:map(fun sys:get_state/1, rabbit_channel:list_local()).' -n rabbit-1 | tee /tmp/channel_state.txt

The output of this last eval can be enormous, so it only makes sense to redirect the output to a file, then compress the file before sharing.

tubemeister · 2023-08-30T12:43:44Z

tubemeister
Aug 30, 2023
Author

Right. Ran the test scenario again. Attached a tarball with all logfiles and this timeline again.

logfiles.tar.gz

Basics

Versions and config files as at the top of the discussion (plus debug logging)
Vhost: timseries
Exchange: telegraf (topic)
Queue: influxdb1 (QQ)

All predeclared.

Publisher needs to be connected to the queue leader. That server then goes hard-offline without a chance to clean up connections or anything. This is simulating a hard crash or network separation of the server or VM or docker container or whatever, not just the rabbitmq-server process being down on an otherwise running and connected machine.

IPs

172.17.255.16: Telegraf, publisher
172.17.255.15: rabbit-1
172.17.255.17: rabbit-2
172.17.255.19: rabbit-3

Timeline

Note: Telegraf log is UTC, everything else is CEST/UTC+2

11:23 Restart cluster with all debug logging
11:23 rabbit-2 is leader
11:24 telegraf publishing on rabbit-2
(note: 9:24 UTC in the log)

rabbitmqctl list_queues name pid -n rabbit@rabbit-2 --vhost timeseries
influxdb1 [email protected]

rabbitmqctl list_connections name pid -n rabbit@rabbit-2 --vhost timeseries
172.17.225.16:34316 -> 172.17.225.17:5672 [email protected]
11:32 save channel state on rabbit-2 (rabbit-2-channel_state-before.txt)
11:32:47 kill network on rabbit-2
(The effect is the same if I do a hard powerdown, but NOT if I do a clean shutdown or leave the server online)
11:32:55 rabbit-1 is now leader (rabbit-3-default.log)
11:32:38 last succesful message consumed from the queue ("Message received" in trace logs)
11:33:18 telegraf reconnects on rabbit-3 (telegraf.log)
11:33:18 telegraf first write logged after reconnect (telegraf.log)
(note: 9:33 UTC in the log)
11:33:18 first "Message published" after reconnect in trace logs
2023-08-30T11:33:18.082+02:00: Message published

rabbitmqctl list_queues name pid -n rabbit@rabbit-1 --vhost timeseries
influxdb1 [email protected]

rabbitmqctl list_connections name pid -n rabbit@rabbit-3 --vhost timeseries
172.17.225.16:44850 -> 172.17.225.19:5672 [email protected]

27x "Message published" without "Message received" in trace logs
No consumer activity during this time, and yes it is connected
No messages in queue during this time either
11:37:28 save channel state on rabbit-1 (rabbit-1-channel_state-node-down.txt)
11:37:34 save channel state on rabbit-3 (rabbit-3-channel_state-node-down.txt)
11:37:47 network back online on rabbit-2
11:37:48 RabbitMQ process on rabbit-2 starts up (rabbit-2-default.log)
11:37:48 first "Message received" (that is, not lost, trace logs)
11:37:48 Consumer consumes first message after rabbit-2 went down
11:38:02 save channel state on rabbit-1 (rabbit-1-channel_state-node-up.txt)
11:38:12 save channel state on rabbit-3 (rabbit-3-channel_state-node-up.txt)

5 replies

kjnilsson Aug 30, 2023
Maintainer

Thanks you. Logs do indeed tell the story much better that external observations and guesses. :)

I think I know what is going on in this case. It would have been good if the test had included a hard kill of the server rather than a network kill (as they should be handled a bit differently). Still the problem may be the same or have the same fix. In this case what is happening is that the meta data store in RabbitMQ (mnesia) isn't updated in a timely manner (due to the network partition) and when the publisher on rabbit-3 is initiating it addresses the wrong node and rather than trying any other nodes that the queue spans it immedately rejects the message. As no publisher confirms are used the client never sees the rejection and the message (from the client's pov) is lost.

I will prepare a patch to address the issue I could observer from the logs which can then be used for testing.

tubemeister Aug 30, 2023
Author

For what it's worth I have observed exactly the same behaviour when I do a hard powerdown on a node, from the point of view of the publisher and the rest of the cluster it doesn't make a difference anyway.

This particular test run had the leader and the publisher end up on different nodes, but the same thing happens if the publisher lands on the newly elected leader. But yeah it's bound to be some sort of race condition where the connection is established before everything has reconfigured itself correctly, and then somehow stays in that wrong state until the cluster is reconfigured again by the return of the downed node.

Thanks for looking into this.

kjnilsson Aug 30, 2023
Maintainer

It is possible that the way you do a hard powerdown results in the server not having time to send FIN or RST packets so the other nodes end up with a dangling TCP connection. Quorum queues have their own failure detector so quickly trigger and election which completes but it takes the RabbitMQ meta data store longer to detect this situation during which it isn't able to commit changes. This will be changed in 3.13 / 4.0 when RabbitMQ gets a new meta data store based on the same library as quorum queues.

tubemeister Aug 30, 2023
Author

Yeah, that dangling TCP connection was exactly what I was testing for, how does the cluster respond when things don't go as planned. ;-)

That discrepancy in election time(outs) would explain things. I'm glad new solutions are on the horizon that should fix this. That makes this a manageable problem as a clean shutdown works just fine, it's purely the server crash or net-split scenario that causes a problem. It'll take a few months before my cluster is truly in production anyway, and by that time the end of the year is already close. Thanks! :-)

kjnilsson Aug 30, 2023
Maintainer

Ok here is the PR: #9221

There is an image for testing here: https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/pr-9217-otp-max-bazel/images/sha256-2049171bc146cdde280fb01d31530908e13ade4ff7b4181e676a1cefb50b1b81?context=explore

lukebakken · 2023-08-30T19:47:17Z

lukebakken
Aug 30, 2023
Maintainer

@tubemeister after doing a brief review of the rabbitmq_tracing plugin code, I confirmed with @kjnilsson and @michaelklishin that when Message published is logged it only means that the message has been received by RabbitMQ and routed, not that it has made it to a queue and persisted to disk. This has lead to a good portion of the confusion in this discussion. So, I have made a note to improve our documentation for the plugin. Again, thank you for all of the time you have spent on this - there is most likely a legitimate issue for us to address.

19 replies

tubemeister Aug 31, 2023
Author

I've edited the above, added the rest of the timeline.

michaelklishin Sep 1, 2023
Maintainer

Here's a 3.12.x build with the latest change such as #9221 https://github.com/rabbitmq/rabbitmq-server-binaries-dev/releases/tag/v3.12.5-alpha.8

michaelklishin Sep 1, 2023
Maintainer

I recommend testing with 3.12.x alphas because 3.13 has a lot of entirely unrelated changes. If this means we have to optimistically backport to v3.12.x, that's fine, changes can be reverted at any moment if we have reasons to do so.

kjnilsson Sep 1, 2023
Maintainer

Yes v3.12.5-alpha.8 should contain #9241 which has some additional tweaks that may help in the case where we're setting up a new publisher when a node is partitioned.

mkuratczyk Sep 4, 2023
Maintainer

just to close the loop, Loïc has fixed the issue that caused a message store failure when upgrading to main.
#9274

tubemeister · 2023-09-04T09:59:38Z

tubemeister
Sep 4, 2023
Author

Been a busy couple of days, I've finally managed to bundle/clean up the log files from two runs I did last week, one failure one working. This is still on the 3.13 alpha. I'll include them here.

failed.tar.gz
working.tar.gz

Looks like I'm going to be away from the internet for the rest of the week. Next week I'll rebuild my test cluster with the 3.12 alpha and run the test again.

Thanks for all the effort so far.

6 replies

tubemeister Sep 13, 2023
Author

Just done three tests in a row against 3.12.5 alpha 19, and three times it worked.

I'll beat on it some more, but so far it's looking good.

tubemeister Sep 13, 2023
Author

Another half dozen tests all came up good too. Excellent.

michaelklishin Sep 13, 2023
Maintainer

Thank you for reporting back.

michaelklishin Sep 13, 2023
Maintainer

We will likely ship 3.12.5 early next week. It has something things in flight that I would not rush.

Answer selected by michaelklishin

tubemeister Sep 13, 2023
Author

Thank you for fixing it :-)

I'll wait for the 3.12.5 release before upgrading production. Next week is absolutely fine.

tubemeister · 2023-09-22T09:59:42Z

tubemeister
Sep 22, 2023
Author

3.12.6 installed tested and working, thanks everyone.

0 replies

Published messages are lost until the original host of a QQ leader replica comes back up (even after a new leader is elected) #9209

Uh oh!

tubemeister Aug 29, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

RabbitMQ Tracing log

Rabbit-3 debug log

Telegraf debug log

Versions

Config Telegraf

Config RabbitMQ

Replies: 15 comments · 61 replies

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Scenario one: kill a node no clients are connected to

Scenario two: kill a node the publisher is connected to

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

Uh oh!

tubemeister Aug 30, 2023 Author

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

Uh oh!

michaelklishin Aug 29, 2023 Maintainer

Uh oh!

kjnilsson Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 29, 2023 Author

Uh oh!

lukebakken Aug 29, 2023 Maintainer

Uh oh!

tubemeister Aug 30, 2023 Author

Uh oh!

tubemeister
Aug 29, 2023

Replies: 15 comments 61 replies

michaelklishin
Aug 29, 2023
Maintainer

michaelklishin
Aug 29, 2023
Maintainer

tubemeister
Aug 29, 2023
Author

michaelklishin Aug 29, 2023
Maintainer

tubemeister Aug 29, 2023
Author

tubemeister Aug 29, 2023
Author

michaelklishin Aug 29, 2023
Maintainer

tubemeister Aug 29, 2023
Author

michaelklishin
Aug 29, 2023
Maintainer

tubemeister Aug 29, 2023
Author

tubemeister Aug 30, 2023
Author

michaelklishin
Aug 29, 2023
Maintainer

tubemeister Aug 29, 2023
Author

michaelklishin Aug 29, 2023
Maintainer

michaelklishin Aug 29, 2023
Maintainer

michaelklishin
Aug 29, 2023
Maintainer

michaelklishin
Aug 29, 2023
Maintainer

tubemeister Aug 29, 2023
Author

michaelklishin Aug 29, 2023
Maintainer

michaelklishin Aug 29, 2023
Maintainer

tubemeister Aug 29, 2023
Author

michaelklishin Aug 29, 2023
Maintainer

kjnilsson
Aug 29, 2023
Maintainer

tubemeister Aug 29, 2023
Author

lukebakken Aug 29, 2023
Maintainer

tubemeister Aug 30, 2023
Author