[Suggestion] Fix potential dump of huge queue process state to log #14349
Replies: 2 comments 3 replies
-
The format depth part would also be good to investigate for use in other (non-gen_server2) parts of the broker. I have seen a very large error log from Crash details...I saw this running a horrible perf-test workload against
This was from publishing with perf-test with a workload that sends a crazy ~3.7 GB/sec of message bodies (my beefy computer was unsurprisingly unhappy!).
|
Beta Was this translation helpful? Give feedback.
-
This change can make investigating certain issues harder but I see the benefits, of course. @lukebakken instead of using just atoms, e.g. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
RabbitMQ series
4.1.x
Operating system (distribution) used
Ubuntu 24
How is RabbitMQ deployed?
Community Docker image
What would you like to suggest for a future version of RabbitMQ?
Currently, if a crash happens in the
rabbit_amqqueue_process
, there can be huge amounts of data that is formatted and logged. If enough message data is in-memory, this can OOM-kill the server / container on which RabbitMQ is running.See the following diff for an idea how to address this issue:
main...lukebakken:rmq-rabbitmq-server:lukebakken/fix-format-status
Note that
rabbit_amqqueue_process
already exports theformat_status
function, as defined by thegen_server2
behavior. Since RabbitMQ changed its logging to use the latest OTP logging library, and OTP interacts withgen_server
differently, this function is not called when the process crashes.Arguably, the correct fix is to update
gen_server2
to the currentgen_server
code, or find a way to removegen_server2
entirely. For the time being, I think this change torabbit_amqqueue_process
and the backing queue modules is the least intrusive fix.In addition to these changes, setting the following kernel parameter in
advanced.config
also drastically reduces the amount of logging for crashes with large mailboxes (like queue processes):That configuration setting is deprecated, and I'm looking into how
rabbit_prelaunch_logging
can be modified to do the same thing as that setting.To see the amount of data logged without
format_state
, modify my patch so thatgen_server2:terminate
does not callmaybe_format_state
. Start up RabbitMQ, and use PerfTest to publish some messages:After the 1000th message, the
rabbit_amqqueue_process
process will crash due tobadmatch
. If you catch the backing queue state with a lot of messages in RAM pending ack, thevqstate
will look like this:rabbit-no-truncated-state.log
Everything after
11,11,
are messages stored in RAM. With large enough messages, and enough of them, this can OOM a RabbitMQ instance.With my proposed code, the message data is modified by
rabbit_variable_queue:format_state
to be the atomram_pending_ack_truncated
.Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions