client connection drops and timeouts during rabbitmq_queue_messages_published_total statistic drop #4743
-
EnvironmentRabbitMQ: 3.19.13 Error descriptionWe are seeing peaks of send timeouts, connection drops and tls handshake timeouts sporadically but recurring for a short period of time. All cluster nodes within our cluster were affected independently from their load. Reproduction stepsUnknown. AnalysisWe analyzed this problem for several months not having a clue whats causing the issue except rabbitmq itself. BUT today we found out that these events EXACTLY correlate with massive drops in the We don't know if this is the root cause or just another aspect of the problem. All other rabbitmq statistics seem to be normal before, during and after the event. We don't even understand why the @gerhard @michaelklishin Maybe you could give a hint what could cause such statistic drops, so we can dig deeper into the root cause. Thanks in advance and best regards |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 11 replies
-
Correlation does not mean causation. Metrics related to publishing will drop when a publishing rate drops, in particular, when publishing connections can no longer publish for any reason. No queues Both can easily have the same root cause: anything that prevents network activity on a connection The idea that a metric drop can lead to client connection closure seems a bit far fetched to me. |
Beta Was this translation helpful? Give feedback.
-
I understand and follow your reasoning. No causation - just a correlation - which might help to find the root cause. So just to understand this in detail. There's a drop of about 20 million published messages in the statistics (70% total). So I tried to find the erlang code which is resetting the metrics:
Within the channel (isn't handle_consuming_queue_down_or_eol the consumer side?): And during a metric garbage collection event (every 2 minutes?) Because we are having a huge cluster with several thousand queues and bindings the garbage collection caught my attention. Taking a look at the ets allocations: There seems to be a magical limit for the ETS memory of about 1.5 GB which is hit 30 minutes before the "statistic events" are starting. But maybe just another correlating event. |
Beta Was this translation helpful? Give feedback.
-
Hello! Please provide the logs around the time of one of those events. |
Beta Was this translation helpful? Give feedback.
Correlation does not mean causation. Metrics related to publishing will drop when a publishing rate drops, in particular, when publishing connections can no longer publish for any reason. No queues
have to be deleted for that to happen.
Both can easily have the same root cause: anything that prevents network activity on a connection
(could be a resource alarm or anything around the infrastructure involved) will affect both clients
and metric values reported.
The idea that a metric drop can lead to client connection closure seems a bit far fetched to me.
It very likely works the other way around.