Node has consumed 500 GiB of disk space within a short time window #8871

mmichalski11 · 2023-07-16T20:25:18Z

mmichalski11
Jul 16, 2023

Describe the bug

We have issues with running rabbit 3.9.24 for our Openstack services with mirror queues and we've observed quite frequent crashes with no space left error, note that filesystem with rabbit does have ~500GiB of free space. We had set low disk space watermark to 10GiB in order to try and catch what actually is trying to write so much data, but with no luck - it still crashed. After rabbitmq crash disk space is reclaimed and frees up instantly.
Log provided below:
rabbitmq_crash.log
erl crash dump:
erl_crash_dump.log
screen shot of our disk space monitoring, you can see it filling up quickly:

I would appreciate support and guidance on how to resolve this issue.

Reproduction steps

Running mirrored queues with many short lived connections
Wait for disaster to happen

Expected behavior

Running rabbit does not take up 500 GiB of disk space out of nowhere and when disk resource limit alarm is set it should not write to disk anymore.

Additional context

No response

lukebakken · 2023-07-16T23:51:21Z

lukebakken
Jul 16, 2023
Maintainer

Currently you are asking us to mostly guess how to reproduce your issue. Two log files and some prose is insufficient. Please provide the following:

Your complete RabbitMQ configuration files. Attach them to the discussion.
A script or program we can run to simulate your workload and thus trigger the issue.

Thanks

0 replies

michaelklishin · 2023-07-17T04:30:16Z

michaelklishin
Jul 17, 2023
Maintainer

RabbitMQ 3.9 has reached end of life for users without a support subscription some six months ago. Classic mirrored queues are deprecated.
Short-lived connections are discouraged.

Disk space usage cannot be considered to be a bug. It is almost always a function of what applications do, sometimes intentionally and sometimes not. See these scenarios described in this discussion in the context of quorum queues.

My best guess without any data to work with

Neither short lived connections nor classic queue mirroring per se contribute to such massive spikes in
disk space usage. Only monitoring the data directory structure can tell what it what, however,
I have a guess: your applications publish messages as transient without you realizing it,
plus some queues or messages are left unconsumed for whatever reason.

What that happens, nodes will continue accumulating messages in memory because applications
explicitly asked for it. But when the amount of free memory goes lower than a certain threshold (we have to guess to what it is without rabbitmq-diagnostics environment output, by default it is 50% of the memory the node is allowed to use), most or all data from all queues will be moved to disk within a short period of time.

How modern versions are different

Classic queues v2 and modern quorum queues move data to disk with a very small "working set" in memory, ignoring
the persistent vs. transient property set by publishers.

Or that disk space was used by something else, it's not uncommon to see people co-locate services
and then blame activity of one service on the other.

0 replies

michaelklishin · 2023-07-17T04:31:23Z

michaelklishin
Jul 17, 2023
Maintainer

@lukebakken unless this behavior can be reproduced with 3.11 or 3.12, we should disengaged. "End of life" for a series means "end of life", in particular since the OpenStack community in general never contribute or pay for support.

0 replies

michaelklishin · 2023-07-17T05:36:50Z

michaelklishin
Jul 17, 2023
Maintainer

One more relevant topic here is the ratio of node memory limit to its available disk space. Historically the recommendation has been one to one.

For a node to use up 500 GiB of disk space, it must either have a comparable backlog (including unconfirmed messages), or a comparable heap size. For OpenStack installations is rare to see, the latter is fairly rare to see.

That makes me think of another couple of hypotheses: something else on the host has used up the disk space, or RabbitMQ does uses a different filesystem volume from what's being monitored.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Node has consumed 500 GiB of disk space within a short time window #8871

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Node has consumed 500 GiB of disk space within a short time window #8871

Uh oh!

Uh oh!

mmichalski11 Jul 16, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 4 comments

Uh oh!

lukebakken Jul 16, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin Jul 17, 2023 Maintainer

My best guess without any data to work with

How modern versions are different

Uh oh!

michaelklishin Jul 17, 2023 Maintainer

Uh oh!

michaelklishin Jul 17, 2023 Maintainer

mmichalski11
Jul 16, 2023

lukebakken
Jul 16, 2023
Maintainer

michaelklishin
Jul 17, 2023
Maintainer

michaelklishin
Jul 17, 2023
Maintainer

michaelklishin
Jul 17, 2023
Maintainer