Skip to content

Quorum queue memory usage, and sensible defaults for node resource requests #1553

@tkoft

Description

@tkoft

Describe the bug

We occasionally find in our GKE rabbitmq cluster (3 nodes) that memory alarms will get triggered, and causing issues with publishes. Digging through docs, I've learned that:

It seems like for a cluster using these defaults, a memory alarm will certainly be triggered at some point, even with just one quorum queue? There are reports of some folks seeing this issue unless they lower the default WAL size limit.

I was also unclear on what happens when these alarms are set. The docs say that publishes are blocked, but is that to the offending node only or to the whole cluster? I did find in AWS MQ docs this:

In cluster deployments, queues might experience paused synchronization of messages between replicas on different nodes. Paused queue syncs prevent consumption of messages from queues and must be addressed separately while resolving the memory alarm.

So a couple questions:

  • Do memory alarms indeed block publishes to all nodes? Or just the one?
  • If so, why is this the case? Is there any way to mitigate the risk of the entire cluster being paused from one node triggering an alarm?
  • If not, can a memory alarm on one node cause quorum queue replicas on other nodes to stop accepting publishes too like what's describe in AWS docs?

Regardless, it seems to me that more sensible defaults could be configured here.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy a cluster using RabbitMQ Cluster Kubernetes Operator
  2. Publish and consume a quorum queue on the instance
  3. Observe memory usage increases on a node until memory alarm is set
  4. From memory use reporting on the node, observe that quorum queue tables are what's growing and causing the alarm
  5. Publishes start getting blocked, even though two other nodes are under the high-watermark

Expected behavior
By default, I expect quorum queue WAL size threshold and cluster operator's memory requests to work with each other so memory alarm's aren't triggered by normal usage of quorum queues.

Screenshots

Screenshot 2024-02-06 at 1 40 45 PM

Version and environment information

  • RabbitMQ: Cluster operator default (3.10.2)
  • RabbitMQ Cluster Operator: 2.1.0
  • Kubernetes: 1.26.6-gke.1700
  • Cloud provider or hardware configuration: GKE autopilot

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions