RabbitMQ on GKE: Pod (node) Failure Due to Free disk space is insufficient. #10611
-
Hello everyone, I'm encountering a recurring issue with RabbitMQ deployed on a Google Kubernetes Engine (GKE) cluster. I have RabbitMQ installed with three replicas, but every morning, one of the nodes fails due to insufficient disk space. This leads to the other nodes getting blocked as well. The strange thing is that after restarting the affected pod, everything starts working fine again. However, the problem repeats itself every morning, and I'm not sure what's causing it. Details:
Any insights or suggestions would be greatly appreciated. Thank you in advance for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 9 replies
-
You haven't provided much information.
|
Beta Was this translation helpful? Give feedback.
-
Please see this discussion - #10516 (comment) |
Beta Was this translation helpful? Give feedback.
-
Streams with high retention intervals and quorum queues with consumers that do not acknowledge messages (usually due to incorrect or incomplete error handling) are two most likely root causes. There can be other reasons, of course, without a node's data directory footprint breakdown, and possibly without the oldest segment file on disk for Raft-based features, we cannot tell, and we do not guess in this community. |
Beta Was this translation helpful? Give feedback.
-
According to the (very little) relevant data shared, this is a very generic question about peak disk space usage of a node. The node has a backlog of at least 647K messages, maybe more, with lots of classic queues and no QQs or streams in sight. Therefore less obvious situations like a stuck consumer at the head of a quorum queue probably do not apply. Some relevant doc guides:
Disk is the cheapest resource out there in most environments, so when in doubt, overprovision it. |
Beta Was this translation helpful? Give feedback.
According to the (very little) relevant data shared, this is a very generic question about peak disk space usage of a node.
The node has a backlog of at least 647K messages, maybe more, with lots of classic queues and no QQs or streams in sight. Therefore less obvious situations like a stuck consumer at the head of a quorum queue probably do not apply.
Some relevant doc guides: