-
Notifications
You must be signed in to change notification settings - Fork 313
Description
Describe the bug
We occasionally find in our GKE rabbitmq cluster (3 nodes) that memory alarms will get triggered, and causing issues with publishes. Digging through docs, I've learned that:
- Quorum queues can use up to 512mb memory for write ahead logs
- We don't specify resource requests for our cluster, so by default the rabbitmq cluster operator just requests 2Gi memory for each node
- RabbitMQ sets a high-watermark threshold of 40% of available memory (655 MiB), and once that's surpassed an alarm is triggered that stops new publishes
It seems like for a cluster using these defaults, a memory alarm will certainly be triggered at some point, even with just one quorum queue? There are reports of some folks seeing this issue unless they lower the default WAL size limit.
I was also unclear on what happens when these alarms are set. The docs say that publishes are blocked, but is that to the offending node only or to the whole cluster? I did find in AWS MQ docs this:
In cluster deployments, queues might experience paused synchronization of messages between replicas on different nodes. Paused queue syncs prevent consumption of messages from queues and must be addressed separately while resolving the memory alarm.
So a couple questions:
- Do memory alarms indeed block publishes to all nodes? Or just the one?
- If so, why is this the case? Is there any way to mitigate the risk of the entire cluster being paused from one node triggering an alarm?
- If not, can a memory alarm on one node cause quorum queue replicas on other nodes to stop accepting publishes too like what's describe in AWS docs?
Regardless, it seems to me that more sensible defaults could be configured here.
To Reproduce
Steps to reproduce the behavior:
- Deploy a cluster using RabbitMQ Cluster Kubernetes Operator
- Publish and consume a quorum queue on the instance
- Observe memory usage increases on a node until memory alarm is set
- From memory use reporting on the node, observe that quorum queue tables are what's growing and causing the alarm
- Publishes start getting blocked, even though two other nodes are under the high-watermark
Expected behavior
By default, I expect quorum queue WAL size threshold and cluster operator's memory requests to work with each other so memory alarm's aren't triggered by normal usage of quorum queues.
Screenshots
Version and environment information
- RabbitMQ: Cluster operator default (3.10.2)
- RabbitMQ Cluster Operator: 2.1.0
- Kubernetes: 1.26.6-gke.1700
- Cloud provider or hardware configuration: GKE autopilot