client connection drops and timeouts during rabbitmq_queue_messages_published_total statistic drop #4743

motmot80 · 2022-05-05T20:40:28Z

motmot80
May 5, 2022

Environment

RabbitMQ: 3.19.13
Erlang version: 24.3.2

Error description

We are seeing peaks of send timeouts, connection drops and tls handshake timeouts sporadically but recurring for a short period of time. All cluster nodes within our cluster were affected independently from their load.

Reproduction steps

Unknown.

Analysis

We analyzed this problem for several months not having a clue whats causing the issue except rabbitmq itself.

BUT today we found out that these events EXACTLY correlate with massive drops in the rabbitmq_queue_messages_published_total statistics. This seems to be another indication that the problem is located within the rabbitmq.

We don't know if this is the root cause or just another aspect of the problem. All other rabbitmq statistics seem to be normal before, during and after the event.

We don't even understand why the rabbitmq_queue_messages_published_total statistics could drop without deleting any queues. Garbage collection should be no issue in erlang. So what could cause these events?

@gerhard @michaelklishin Maybe you could give a hint what could cause such statistic drops, so we can dig deeper into the root cause.

Thanks in advance and best regards
Thomas

Answered by michaelklishin

May 5, 2022

Correlation does not mean causation. Metrics related to publishing will drop when a publishing rate drops, in particular, when publishing connections can no longer publish for any reason. No queues
have to be deleted for that to happen.

Both can easily have the same root cause: anything that prevents network activity on a connection
(could be a resource alarm or anything around the infrastructure involved) will affect both clients
and metric values reported.

The idea that a metric drop can lead to client connection closure seems a bit far fetched to me.
It very likely works the other way around.

View full answer

michaelklishin · 2022-05-05T20:50:05Z

michaelklishin
May 5, 2022
Maintainer

Correlation does not mean causation. Metrics related to publishing will drop when a publishing rate drops, in particular, when publishing connections can no longer publish for any reason. No queues
have to be deleted for that to happen.

Both can easily have the same root cause: anything that prevents network activity on a connection
(could be a resource alarm or anything around the infrastructure involved) will affect both clients
and metric values reported.

The idea that a metric drop can lead to client connection closure seems a bit far fetched to me.
It very likely works the other way around.

4 replies

motmot80 May 5, 2022
Author

I really don't think that this is a causation.

So - if I understand you right - you think that the dropped publishers are causing the statistics drop?

But the connection drops are not in the same dimension:

rabbitmq_channels:

rabbitmq_connections_closed_total:

rabbitmq_queue_messages_published_total:

motmot80 May 5, 2022
Author

Concerning the network during this phase (1st cluster node):

Traffic:

Sockets:

michaelklishin May 5, 2022
Maintainer

Metric calculation is derivative from actual publisher activity and does not affect it. I don't see
how any parts of the code that compute, store or serve metrics can affect publishers.

I can think of plenty of scenarios where something affects new and currently open publisher
connections, and that also causes a rapid drop in publisher-related metrics.

In fact, in applications with short lived publishers that churn through connections a lot it will be visible particularly quickly.

michaelklishin May 5, 2022
Maintainer

There is an uptick in the number of sockets used. If a RabbitMQ node, a proxy or something like that goes above the open file handle limit, no new connections will be accepted for a period of time.
With short lived publishers that will also result in a sharp drop of several publisher-related metrics.

These are the things I'd look out for, starting with potential evidence in the logs.

motmot80 · 2022-05-05T23:06:03Z

motmot80
May 5, 2022
Author

I understand and follow your reasoning. No causation - just a correlation - which might help to find the root cause.
The connection churn is pretty much the same every day and is throttled to about 20 connections/sec on each cluster node.

So just to understand this in detail.

There's a drop of about 20 million published messages in the statistics (70% total).
Having all connections restartet at night did not even once cause this high statistic drop rate.
So following your theory there must be a very small amount of publishers having produced round about 20 million messages in total disconnecting at ~ 8 AM the last two days but not the day before.

So I tried to find the erlang code which is resetting the metrics:

rabbitmq_queue_messages_published_total ==
queue_messages_published_total ==
channel_stats(queue_exchange_stats, publish, Id, Value) ==
channel_queue_exchange_metrics (ETS Id 2)

Within the channel (isn't handle_consuming_queue_down_or_eol the consumer side?):

rabbitmq-server/deps/rabbit/src/rabbit_channel.erl

Line 2443 in cca1164

erase_queue_stats(QName) ->

And during a metric garbage collection event (every 2 minutes?)

rabbitmq-server/deps/rabbit/src/rabbit_core_metrics_gc.erl

Line 97 in 7abf749

gc_process_and_entities(channel_queue_exchange_metrics, GbSet, ExchangeGbSet).

Because we are having a huge cluster with several thousand queues and bindings the garbage collection caught my attention.
Could it be that the garbage collection todos are building up for longer than the 2 minute interval?
Maybe the local queue gb_set is getting out of sync or the query for queues is timing out.

Taking a look at the ets allocations:

1st node (24h)

2nd node (24h)

3rd node (24h)

There seems to be a magical limit for the ETS memory of about 1.5 GB which is hit 30 minutes before the "statistic events" are starting.
Just to mention there's plenty of RAM and CPU left.

But maybe just another correlating event.

0 replies

lhoguin · 2022-05-06T07:17:14Z

lhoguin
May 6, 2022
Maintainer

Hello! Please provide the logs around the time of one of those events.

7 replies

michaelklishin May 6, 2022
Maintainer

There is no such metric because there is no universally accepted definition of "an active publisher". The closest metric there is is connection egress data rate (outbound data rate) which for publishers will be constantly above a certain threshold. It won't be zero for consumers
either but on average, will be quite a bit lower.

motmot80 May 6, 2022
Author

@michaelklishin I expressed myself misleadingly: "100% connection churn in our hands" means throttling the consumer attaches. The IoT production scenario is much more complex.

We have reproduced it in an isolated scenario.

Reproduction steps:

Connect 100 publishers
Constantly publish a small amount of messages (classic queues - non-persistent messages - 200 msg/sec per node)
Let the target queues fill up to about 1 million messages (that's wasn't part of the normal daily test scenario)
Start all the consumers (perftest) to consume the message backlog

Expected:

Consumers are able to consume messages at a medium rate.
Publishers are able to send messages at a low rate.

Actual:

Consumers are able to consume messages at about 2K msg/sec.
Publishers are timing out, causing them to close and reconnect.

The system load (1m) is at < 40%.
The cpu is between < 50%.
The cluster normally is capable of about 15-20 K msg/sec.

On the other hand:
Having almost no message backlog in the queues then the same consumer-attach (4) seems to have almost no impact on the publishers.

Result:

@michaelklishin So what we see is exactly what you predicted. rabbitmq_queue_messages_published_total statistics are dropped when publishers are closing.
The root cause is that a huge amount of publishers were closed (send timeouts) although there were plenty system ressources left.
So the cluster wasn't capable of handling exchange publishes during a higher consuming rate.

Unfortunately there is no metric getting the total publisher count. Maybe that would have helped us to find the cause faster.

Thanks very much for your support and keep up the good work.

We are working on a solution where the connection churn is reduced to about less than 10-20 connections/sec having more cluster nodes and less missbehaving devices.

Best regards
Thomas

michaelklishin May 6, 2022
Maintainer

Thank you for reporting back.

I see how a metric that indicates the number of publishers would help. Unfortunately for us, the definition of a "publisher" will vary from user to user. It's a function of a publishing rate,
and what constitutes a "publisher" is a matter of opinion. Is 1 message an hour good enough to be considered a publisher? :)

But we will discuss how this can be done with a few folks who are involved with RabbitMQ monitoring.

motmot80 May 10, 2022
Author

In my little clueless user/developer world I would have assumed exactly what the RabbitMQ docs are describing:

https://www.rabbitmq.com/publishers.html#terminology

"[...]In general in messaging a publisher (also called "producer") is an application (or application instance) that publishes (produces) messages.[...]"

And yes. In my personal opinion 1 message a year is good enough to be considered a publisher as long as the channel isn't closed! :D

mkuratczyk May 23, 2022
Maintainer

@motmot80 Hey. I'd like to investigate that scenario you provided but I'm missing some of the details. Could you provide full perf-test commands to publish and consume? The exact numbers (number of queues, msg/s, etc) usually need to be adjusted to trigger a given issue on a certain hardware, so I'd appreciate if you could provide all other details so that I don't end up with something significantly different than what you have.

Do you mean something like this?

# fill up the queues
perf-test -x 200 -y 0 -qa x-max-length=1000000 --auto-delete false -C 900000 -qp q-%d -qpf 1 -qpt 200 -c 1000

# start publishing slowly (here: 1 msg/s per queue)
perf-test -x 200 -y 0 -qa x-max-length=1000000  --auto-delete false -qp q-%d -qpf 1 -qpt 200 -c 1 -P 1

# start consuming while the previous command is running
perf-test -x 0 -y 200 -qa x-max-length=1000000 --auto-delete false -qp q-%d -qpf 1 -qpt 200

I'm not sure about the publisher confirms, consumer acks, consumption rate (-R), etc.

client connection drops and timeouts during rabbitmq_queue_messages_published_total statistic drop #4743

Uh oh!

motmot80 May 5, 2022

Environment

Error description

Reproduction steps

Analysis

Replies: 3 comments · 11 replies

Uh oh!

michaelklishin May 5, 2022 Maintainer

Uh oh!

Uh oh!

motmot80 May 5, 2022 Author

Uh oh!

motmot80 May 5, 2022 Author

Uh oh!

michaelklishin May 5, 2022 Maintainer

Uh oh!

michaelklishin May 5, 2022 Maintainer

Uh oh!

motmot80 May 5, 2022 Author

Uh oh!

lhoguin May 6, 2022 Maintainer

Uh oh!

michaelklishin May 6, 2022 Maintainer

Uh oh!

motmot80 May 6, 2022 Author

Reproduction steps:

Expected:

Actual:

Result:

Uh oh!

michaelklishin May 6, 2022 Maintainer

Uh oh!

motmot80 May 10, 2022 Author

Uh oh!

mkuratczyk May 23, 2022 Maintainer

motmot80
May 5, 2022

Replies: 3 comments 11 replies

michaelklishin
May 5, 2022
Maintainer

motmot80 May 5, 2022
Author

motmot80 May 5, 2022
Author

michaelklishin May 5, 2022
Maintainer

michaelklishin May 5, 2022
Maintainer

motmot80
May 5, 2022
Author

lhoguin
May 6, 2022
Maintainer

michaelklishin May 6, 2022
Maintainer

motmot80 May 6, 2022
Author

michaelklishin May 6, 2022
Maintainer

motmot80 May 10, 2022
Author

mkuratczyk May 23, 2022
Maintainer