Rabbit Event can become a bottleneck and crash RabbitMQ with OOM #4872

luos · 2022-05-20T13:00:01Z

luos
May 20, 2022

Hi,

In installations with high connection churn, the rabbit_event process can become a bottleneck. We know it is not advised to have this high connection churn - but it happens sometimes.

This can lead to RabbitMQ crashing with OOM. This can even happen if only the management interface is started and no other plugins. When RabbitMQ is allocated with a high number of CPU cores then it is especially susceptible to this as it's able to accept more connections / second.

I made some tests and if rabbit_event process is set as high priority then the issue does not happen.

Would you be open to exposing this as a config option?

           gen_event:start_link(
             {local, ?MODULE},
             [{spawn_opt, [{fullsweep_after, 0}, {priority, high}]}]
           );

This could also affect pg_local as it seems like that is the next in line for this bottleneck. Though it seems that only happens with very high core counts (128).

Maybe it could be just set default as high?

Probably in the future is to rework rabbit_event as now it is getting more and more overloaded with events, for example remove stats from it and move it to a different event handler at least.

To reproduce this issue is to start at least a RabbitMQ with 36 cores ore more and start an application with many workers connecting/reconnecting from at least two other hosts configured with the following:

echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range

It's easy to reproduce if you turn on the rabbit_event_exchange plugin, but it can happen without it as well.

Let me know what you think.

michaelklishin · 2022-05-20T13:19:06Z

michaelklishin
May 20, 2022
Maintainer

Any single process part can become a bottleneck given a degenerate enough case. Unless there is a specific suggestion as to what can be done to make rabbit_event more resilient, our recommendation would still be to avoid connection, channel and queue churn.

Dropping events much like Erlang logger and Lager do would not be well received by some users.

0 replies

michaelklishin · 2022-05-20T13:22:36Z

michaelklishin
May 20, 2022
Maintainer

rabbitmq_event_exchange can be used naively by using a single queue to consume a disproportionately large ingress message flow. We know very well by now that a single consumer, N fast publishers does not end well. Credit flow won't be particularly effective in that case and dropping some data is the only approach the Erlang community has developed to mitigate.

0 replies

luos · 2022-05-20T13:22:39Z

luos
May 20, 2022
Author

My suggestion would be to add a configuration option with sets rabbit_event process to high priority. Would you accept that change?

0 replies

michaelklishin · 2022-05-20T14:05:27Z

michaelklishin
May 20, 2022
Maintainer

Sure but our prior experience suggests that it's a band aid that won't help much. Have you
measured the effects of this?

@mkuratczyk @lhoguin @dumbbell FYI.

0 replies

lhoguin · 2022-05-20T14:35:00Z

lhoguin
May 20, 2022
Maintainer

I would highly suggest never using high priority for anything other than distribution processes and other similar processes deeply tied to the inner workings of the VM. It will otherwise mess things up far more than it helps.

0 replies

mkuratczyk · 2022-07-22T12:14:02Z

mkuratczyk
Jul 22, 2022
Maintainer

@luos would you be able to test #5301? Ideally with and without the additional rabbit_event change suggested in my comment?

1 reply

luos Jul 22, 2022
Author

Thank you, I will try to do it next week if I can. The change is definitely a good idea. 👍

If you are looking into this issue generally, I'd also suggest to see if removing flush from the call below would improve performance as I saw pg_local just scanning the mailbox forever. This is also related to channel and connection tracking but another mechanism as they connections enter / exit the group.

https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/pg_local.erl#L214

Though your numbers show that this is already a good improvement! 💯

luos · 2022-07-22T12:56:15Z

luos
Jul 22, 2022
Author

Regarding earlier discussions, we set the rabbit_event priority to high but the cluster still locked up, even though rabbit_event was empty, so definitely the bottleneck is somewhere else, so probably that can not be a (full) solution. We were also suspecting mnesia but then just refactored stuff to not have high churn...

0 replies

mkuratczyk · 2022-08-05T17:23:11Z

mkuratczyk
Aug 5, 2022
Maintainer

I just noticed this PR. Haven't tested but there's a pretty good chance that it will further improve the situation (although moving tracking tables to ETS was probably the most important step): erlang/otp#6199

0 replies

Rabbit Event can become a bottleneck and crash RabbitMQ with OOM #4872

Uh oh!

luos May 20, 2022

Replies: 8 comments · 1 reply

Uh oh!

Uh oh!

michaelklishin May 20, 2022 Maintainer

Uh oh!

michaelklishin May 20, 2022 Maintainer

Uh oh!

luos May 20, 2022 Author

Uh oh!

michaelklishin May 20, 2022 Maintainer

Uh oh!

lhoguin May 20, 2022 Maintainer

Uh oh!

mkuratczyk Jul 22, 2022 Maintainer

Uh oh!

luos Jul 22, 2022 Author

Uh oh!

Uh oh!

luos Jul 22, 2022 Author

Uh oh!

mkuratczyk Aug 5, 2022 Maintainer

luos
May 20, 2022

Replies: 8 comments 1 reply

michaelklishin
May 20, 2022
Maintainer

michaelklishin
May 20, 2022
Maintainer

luos
May 20, 2022
Author

michaelklishin
May 20, 2022
Maintainer

lhoguin
May 20, 2022
Maintainer

mkuratczyk
Jul 22, 2022
Maintainer

luos Jul 22, 2022
Author

luos
Jul 22, 2022
Author

mkuratczyk
Aug 5, 2022
Maintainer