tls_sender processes holding on to memory during memory alarm #5346

gomoripeti · 2022-07-27T18:49:40Z

gomoripeti
Jul 27, 2022

We noticed that after a spike of traffic, BEAM memory grew to a level which set the memory resource limit alarm. At this point publishers were blocked, consumers consumed all messages however memory still did not decrease. A large amount was associated to binaries > other. The level of memory did not decrease until a GC of all processes was executed. The issue here is that because of the inactivity of the node, GC would never trigger automatically, so RabbitMQ cannot recover from the memory alarm.

As it turns out that large amount of binaries are referenced by tls_sender processes. RabbitMQ has a smart way to trigger GC of its own processes (https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit_common/src/rabbit_writer.erl#L412) however if the protocol uses TLS instead of TCP, there are extra OTP proxy processes which are prone to the same issue.

Question Wonder if there is anything to do about this on the RabbitMQ side (eg could maybe_gc_large_msg trigger GC also for the tls_sender process not just the writer?) or this is exclusively an OTP issue? Did anyone face this problem before?

To reproduce

I set high_memory watermark to a low value ~748 MiB (to avoid OOM-killer and hit the high watermark easily)
open a publisher and 15 consumer connections with secure AMQPS
publish 20MB messages 1/second and consume 1 with each consumer (hence one of those messages would go through each tls_sender process)

The issue needs TLS connection and few large messages instead of many smaller ones. The customer had few 100MB messages (I know, anti-pattern :( ) Also if a connection is closed, the associated tls_sender process terminates obviously freeing up the binaries.

screenshots from reproduction:

%% client side
1> f(CL), CL = tls_client_large_msg:test(15, 20_000_000).
publishing 1 ...consumed
publishing 2 ...consumed
publishing 3 ...consumed
publishing 4 ...consumed
publishing 5 ...consumed
publishing 6 ...consumed
publishing 7 ...consumed
publishing 8 ...consumed
publishing 9 ...consumed
publishing 10 ...consumed
publishing 11 ...consumed
publishing 12 ...publish timeout

%% server side

%% sum bytes, referenced by tls_senders
(rabbit@test-peter3-01)31> [{Pid, proc_lib:initial_call(Pid), Size} || {Pid, Size, _} <- recon:proc_count(binary_memory, 14)].
[{<0.21612.0>,{tls_sender,init,['Argument__1']},40584206},
 {<0.21597.0>,{tls_sender,init,['Argument__1']},40567912},
 {<0.21672.0>,{tls_sender,init,['Argument__1']},40567777},
 {<0.21657.0>,{tls_sender,init,['Argument__1']},40551438},
 {<0.21702.0>,{tls_sender,init,['Argument__1']},40551393},
 {<0.21717.0>,{tls_sender,init,['Argument__1']},40551348},
 {<0.21582.0>,{tls_sender,init,['Argument__1']},40535090},
 {<0.21567.0>,{tls_sender,init,['Argument__1']},40535090},
 {<0.21642.0>,{tls_sender,init,['Argument__1']},40535000},
 {<0.21627.0>,{tls_sender,init,['Argument__1']},40535000},
 {<0.21687.0>,{tls_sender,init,['Argument__1']},40534955},
...

%% number of binaries that could be garbage collected, referenced by tls_senders
(rabbit@test-peter3-01)38> [{Pid, proc_lib:initial_call(Pid), Diff} || {Pid, Diff, _} <- recon:bin_leak(15)].
[{<0.23141.0>,{tls_sender,init,['Argument__1']},-1637},
 {<0.23126.0>,{tls_sender,init,['Argument__1']},-1637},
 {<0.23111.0>,{tls_sender,init,['Argument__1']},-1637},
 {<0.23186.0>,{tls_sender,init,['Argument__1']},-1634},
 {<0.23171.0>,{tls_sender,init,['Argument__1']},-1634},
 {<0.23156.0>,{tls_sender,init,['Argument__1']},-1634},
 {<0.23201.0>,{tls_sender,init,['Argument__1']},-1631},
 {<0.23096.0>,{tls_sender,init,['Argument__1']},-1607},
 {<0.23081.0>,{tls_sender,init,['Argument__1']},-1607},
 ...

Answered by michaelklishin

Jul 28, 2022

I am not aware of a way to obtain a reference to their tls_sender. You can enable background GC which runs periodically for such environments.

20 MB messages arguably belong to a blob store.

View full answer

michaelklishin · 2022-07-28T05:20:17Z

michaelklishin
Jul 28, 2022
Maintainer

I am not aware of a way to obtain a reference to their tls_sender. You can enable background GC which runs periodically for such environments.

20 MB messages arguably belong to a blob store.

3 replies

gomoripeti Jul 28, 2022
Author

that's exactly what I was looking for, as a temporary solution for this special case, I wasn't aware of the background_gc feature, thanks Michal!

gomoripeti Jul 28, 2022
Author

(tls_sender Pid could be deconstructed from the sslsocket record similar to rabbit_net:ssl_get_socket, but it is a hack, the background_gc is a much more official and general way)

michaelklishin Jul 28, 2022
Maintainer

background GC is also a great way to randomize client latency since it disrupts every process on the node

lukebakken · 2022-08-03T23:06:34Z

lukebakken
Aug 3, 2022
Maintainer

NOTE: see the following Erlang SSL application setting that may help in this situation:

https://www.erlang.org/doc/man/ssl.html#type-hibernate_after

4 replies

michaelklishin Aug 4, 2022
Maintainer

Great find. We can extend that idea to all connections and propagate the setting to TLS options.

gomoripeti Aug 4, 2022
Author

great. Although the concept is well known I didn't know this option is built in both ssl and the gen_ modules.

So {hibernate_after, Int} can be set in the advanced config under rabbit > ssl_options.
(By looking at the ssl code and testing I suspect that it only affects the ssl receiver processes (ssl_gen_statem) and not the tls_sender processes, but that's an OTP issue)

ansd Oct 23, 2023
Maintainer

@mkuratczyk noticed that it should also affect the SSL sender thanks to erlang/otp#6480

ansd Oct 23, 2023
Maintainer

#9761 sets this value by default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tls_sender processes holding on to memory during memory alarm #5346

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

tls_sender processes holding on to memory during memory alarm #5346

Uh oh!

gomoripeti Jul 27, 2022

Replies: 2 comments · 7 replies

Uh oh!

michaelklishin Jul 28, 2022 Maintainer

Uh oh!

gomoripeti Jul 28, 2022 Author

Uh oh!

gomoripeti Jul 28, 2022 Author

Uh oh!

michaelklishin Jul 28, 2022 Maintainer

Uh oh!

lukebakken Aug 3, 2022 Maintainer

Uh oh!

michaelklishin Aug 4, 2022 Maintainer

Uh oh!

gomoripeti Aug 4, 2022 Author

Uh oh!

ansd Oct 23, 2023 Maintainer

Uh oh!

ansd Oct 23, 2023 Maintainer

gomoripeti
Jul 27, 2022

Replies: 2 comments 7 replies

michaelklishin
Jul 28, 2022
Maintainer

gomoripeti Jul 28, 2022
Author

gomoripeti Jul 28, 2022
Author

michaelklishin Jul 28, 2022
Maintainer

lukebakken
Aug 3, 2022
Maintainer

michaelklishin Aug 4, 2022
Maintainer

gomoripeti Aug 4, 2022
Author

ansd Oct 23, 2023
Maintainer

ansd Oct 23, 2023
Maintainer