Improve rabbitmq-management-agent gc handling #9320

SimonUnge · 2023-09-06T23:20:52Z

SimonUnge
Sep 6, 2023
Maintainer

Note> I debated a while if this should be an 'issue' or a 'discussion', and clearly ended up here.

Also note this is a situation of (but I still found it interesting):

Doctor, it hurts when I do X
Then don't do X

Setup: 3 node cluster, CMQ.

We noticed a scenario recently where a client created a lot (over 100k) of consumers in a channel, and then the client died.
This lead to the leader node CPU usage shooting through the roof, and the node becoming 'very' unresponsive for a long time (in the 'tens' of minutes, stable within an hour or so), the other nodes occasionally thought the node was down, could not run cli commands etc etc.

I reproduced the scenario, but with a single broker, and ran some flamegraphs, and ended up with the conclusion the issue was with the ets deletions in the rabbit_mgmt_metrics_gc module. Turning off the mgmt plugin and the the same test did not lead to any abnormal CPU usage.

So, I looked into how the cleanup works, and I think we can do some improvements.
First, my observation:

When a channel dies like that, rabbitmq produces a bunch of events, among others it will produce a
channel_closed event, and for each consumer for that channel, a consumer_deleted.

rabbitmq-management-agent subscribes on both.
For the channel_closed, it will cleanup among other things, consumer_stats, and per consumer the consumer_stats_queue_index and consumer_stats_channel_index tables.

For the the consumer_deleted, it will cleanup consumer_stats, and for the consumer, the consumer_stats_queue_index and consumer_stats_channel_index tables.

So, there is a bit of overwork there.

I suggest that when a channel goes down, in rabbit_amqqueue_process:handle_ch_down/2 we do not produce the consumer_deleted event per consumer tag (we still need to call rabbit_core_metrics:consumer_deleted/3 though, otherwise there will be a weird condition where the table entries keep getting recreated), as that will be handled by the channel_closed event regardless. Unless there are plugin out there that rely on getting both?

The above change will not improve the CPU usage much though. As the channel_deleted will loop over all consumers and call ets:delete or ets:delete_object. So, instead of doing that, I suggest calling ets:match_delete, so in
rabbit_mgmt_metrics_gc:index_delete, stop doing this:

index_delete(Table, Type, Id) ->
    IndexTable = rabbit_mgmt_metrics_collector:index_table(Table, Type),
    Keys = ets:lookup(IndexTable, Id),
    [ begin
          ets:delete(Table, Key),
          cleanup_index(Table, Key)
      end
      || {_Index, Key} <- Keys ],
    ets:delete(IndexTable, Id),
    ok.

And instead do the following:

index_delete(consumer_stats = Table, channel = Type, Id) ->
    IndexTable = rabbit_mgmt_metrics_collector:index_table(Table, Type),
    MatchPattern = {'_', Id, '_'},
    %% Delete consumer_stats_queue_index                                                                                                                                                                                                                     
    ets:match_delete(consumer_stats_queue_index,
                     {'_', MatchPattern}),
    %% Delete consumer_stats                                                                                                                                                                                                                                 
    ets:match_delete(consumer_stats,
                     {MatchPattern,'_'}),
    %% Delete consumer_stats_channel_index                                                                                                                                                                                                                   
    ets:delete(IndexTable, Id),
    ok;

I've only tried the above for the consumer_stats table, and imagine the MatchPattern might differ.

With the above changes, I saw no CPU issue killing 100k consumers in one go.

Thoughts? Am I missing something vital?

I have a draft PR but would like to this discussed here first.

Answered by michaelklishin

Sep 8, 2023

@SimonUnge please submit a PR, both changes sound good to me.

I have changed my mind on whether we should keep the consumer_deleted events in place. These internal events are used by audit and monitoring systems. In this particular case, we are not
really dealing with consumer cancellation, so I suspect there isn't much use to emitting the consumer_deleted event. In fact, it may even be counterintuitive and counterproductive to delete them.

View full answer

michaelklishin · 2023-09-08T00:32:47Z

michaelklishin
Sep 8, 2023
Maintainer

@SimonUnge please submit a PR, both changes sound good to me.

I have changed my mind on whether we should keep the consumer_deleted events in place. These internal events are used by audit and monitoring systems. In this particular case, we are not
really dealing with consumer cancellation, so I suspect there isn't much use to emitting the consumer_deleted event. In fact, it may even be counterintuitive and counterproductive to delete them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve rabbitmq-management-agent gc handling #9320

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improve rabbitmq-management-agent gc handling #9320

Uh oh!

SimonUnge Sep 6, 2023 Maintainer

Replies: 1 comment

Uh oh!

michaelklishin Sep 8, 2023 Maintainer

SimonUnge
Sep 6, 2023
Maintainer

michaelklishin
Sep 8, 2023
Maintainer