RabbitMQ on GKE: Pod (node) Failure Due to Free disk space is insufficient. #10611

Zhakyp01 · 2024-02-26T15:07:59Z

Zhakyp01
Feb 26, 2024

Hello everyone,

I'm encountering a recurring issue with RabbitMQ deployed on a Google Kubernetes Engine (GKE) cluster. I have RabbitMQ installed with three replicas, but every morning, one of the nodes fails due to insufficient disk space. This leads to the other nodes getting blocked as well.

The strange thing is that after restarting the affected pod, everything starts working fine again. However, the problem repeats itself every morning, and I'm not sure what's causing it.

Details:

RabbitMQ version: 3.12
GKE cluster version: 1.26.12
RabbitMQ deployed on the GKE cluster with three replicas.
One node fails every morning due to insufficient disk space.
Other nodes get blocked as a result.
Restarting the affected pod resolves the issue temporarily.
The problem repeats daily, indicating a recurring issue.

Any insights or suggestions would be greatly appreciated. Thank you in advance for your help!

Answered by michaelklishin

Feb 26, 2024

According to the (very little) relevant data shared, this is a very generic question about peak disk space usage of a node.

The node has a backlog of at least 647K messages, maybe more, with lots of classic queues and no QQs or streams in sight. Therefore less obvious situations like a stuck consumer at the head of a quorum queue probably do not apply.

Some relevant doc guides:

Message TTL (see the caveats section)
Queue length limits
Monitoring (with some data on what your cluster's peak disk space usage is usually like, you can make an informed decision about the amount of disk space to provision and a suitable limit to use)
Production Checklist, the part on free disk space recommendat…

View full answer

lukebakken · 2024-02-26T15:10:07Z

lukebakken
Feb 26, 2024
Maintainer

You haven't provided much information.

Do you have monitoring in place?
Are you using quorum queues, streams?
Do any quorum queues have many messages in the "Ready" state?

8 replies

Zhakyp01 Feb 26, 2024
Author

Sorry for not providing the requested information. After restarting the pod, the issue has been resolved. I will have to wait until tomorrow to see the error returns.

lukebakken Feb 26, 2024
Maintainer

What size are your messages?

Zhakyp01 Feb 26, 2024
Author

Could you please take a look at this?

michaelklishin Feb 26, 2024
Maintainer

@Zhakyp01 I personally find photos of screens to be outright insulting to OSS maintainers. If you are looking for free support on top of free software, at the very least make some effort and collect the data in an easy to inspect form.

A few management UI screenshots will not tell you what consuming the disk space. du output (with the max depth of 2) is what you must be looking at.

michaelklishin Feb 26, 2024
Maintainer

There is a backlog of over 647K messages in a classic queue, and you are running with the default free disk space limit of 50 MiB, although the docs clearly state that the default is not something recommended for production.

647K messages of 2 KB on average ≈ 1.3 GB of data just for message payloads, for example.

lukebakken · 2024-02-26T16:15:55Z

lukebakken
Feb 26, 2024
Maintainer

Please see this discussion - #10516 (comment)

0 replies

michaelklishin · 2024-02-26T16:19:36Z

michaelklishin
Feb 26, 2024
Maintainer

Streams with high retention intervals and quorum queues with consumers that do not acknowledge messages (usually due to incorrect or incomplete error handling) are two most likely root causes.

There can be other reasons, of course, without a node's data directory footprint breakdown, and possibly without the oldest segment file on disk for Raft-based features, we cannot tell, and we do not guess in this community.

0 replies

michaelklishin · 2024-02-26T16:49:01Z

michaelklishin
Feb 26, 2024
Maintainer

According to the (very little) relevant data shared, this is a very generic question about peak disk space usage of a node.

The node has a backlog of at least 647K messages, maybe more, with lots of classic queues and no QQs or streams in sight. Therefore less obvious situations like a stuck consumer at the head of a quorum queue probably do not apply.

Some relevant doc guides:

Message TTL (see the caveats section)
Queue length limits
Monitoring (with some data on what your cluster's peak disk space usage is usually like, you can make an informed decision about the amount of disk space to provision and a suitable limit to use)
Production Checklist, the part on free disk space recommendations

Disk is the cheapest resource out there in most environments, so when in doubt, overprovision it.

1 reply

Zhakyp01 Feb 26, 2024
Author

Thank you for your assistance

RabbitMQ on GKE: Pod (node) Failure Due to Free disk space is insufficient. #10611

Uh oh!

Zhakyp01 Feb 26, 2024

Replies: 4 comments · 9 replies

Uh oh!

lukebakken Feb 26, 2024 Maintainer

Uh oh!

Zhakyp01 Feb 26, 2024 Author

Uh oh!

lukebakken Feb 26, 2024 Maintainer

Uh oh!

Zhakyp01 Feb 26, 2024 Author

Uh oh!

Uh oh!

michaelklishin Feb 26, 2024 Maintainer

Uh oh!

michaelklishin Feb 26, 2024 Maintainer

Uh oh!

lukebakken Feb 26, 2024 Maintainer

Uh oh!

michaelklishin Feb 26, 2024 Maintainer

Uh oh!

michaelklishin Feb 26, 2024 Maintainer

Uh oh!

Zhakyp01 Feb 26, 2024 Author

Zhakyp01
Feb 26, 2024

Replies: 4 comments 9 replies

lukebakken
Feb 26, 2024
Maintainer

Zhakyp01 Feb 26, 2024
Author

lukebakken Feb 26, 2024
Maintainer

Zhakyp01 Feb 26, 2024
Author

michaelklishin Feb 26, 2024
Maintainer

michaelklishin Feb 26, 2024
Maintainer

lukebakken
Feb 26, 2024
Maintainer

michaelklishin
Feb 26, 2024
Maintainer

michaelklishin
Feb 26, 2024
Maintainer

Zhakyp01 Feb 26, 2024
Author