Cluster with all Quorum Queues performance limits #8113

pecigonzalo · 2023-05-05T16:44:23Z

pecigonzalo
May 5, 2023

Hi! We are running a cluster with around 50 Queues and PerfTest to test the limits of a given cluster setup and try to understand the metrics that will indicate our cluster does not have more capacity.

So far, we have been unable to find clear metrics of the cluster reaching its limit aside from the side-effect of the latency on the PerfTest metrics.

Setup:

Deployment Method: Kuberentes Operator
NodeGroup
- Storage with GP3, 1000 IOPS, 125MB
- Type: c5.4xlarge

Cluster config:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: cluster-a
  namespace: rabbitmq
spec:
  replicas: 3
  service:
    type: ClusterIP
  resources:
    requests:
      cpu: 12
      memory: 16Gi
  rabbitmq:
    additionalPlugins:
      - rabbitmq_federation
      - rabbitmq_federation_management
    additionalConfig: |
      management.path_prefix = /rabbitmq-cluster-a
      log.console = true
      log.console.level = info
      log.console.formatter.json.field_map = verbosity:v time msg domain file line pid level:-
      log.console.formatter.json.verbosity_map = debug:7 info:6 notice:5 warning:4 error:3 critical:2 alert:1 emergency:0
      log.console.formatter.time_format = epoch_usecs
      cluster_partition_handling = pause_minority
      collect_statistics_interval = 10000
    # Disable RABBITMQ_LOGS=- which is set in
    # https://github.com/docker-library/rabbitmq/blob/ece63d4534cc44abd6b1ec35bbd9aa0d21e87e1d/3.9/ubuntu/Dockerfile#L211
    # Otherwise, this environment variable will override all log configurations in additionalConfig.
    envConfig: |
      RABBITMQ_LOGS=""
  persistence:
    storageClassName: ebs-sc
    storage: "50Gi"
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                  - cluster-a
          topologyKey: kubernetes.io/hostname
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            topologySpreadConstraints:
              - maxSkew: 1
                topologyKey: "topology.kubernetes.io/zone"
                whenUnsatisfiable: DoNotSchedule
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: cluster-a

We put no limits to avoid any CPU throttling potential problems that were present in K8s (We tried with limits as well)

PerfTest Config

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: perftest
  namespace: default
spec:
  serviceName: perftest
  podManagementPolicy: "Parallel"
  replicas: 0
  selector:
    matchLabels:
      app: perftest
  template:
    metadata:
      labels:
        app: perftest
    spec:
      containers:
        - name: perftest
          image: pivotalrabbitmq/perf-test
          env:
            - name: QUEUE_PATTERN
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: QUEUE_COUNT
              value: "1"
          args:
            - --metrics-prometheus
            - --uri
            - "amqp://perf:[email protected]"
            - -f
            - persistent
            - -ad
            - "false"
            - -qq
            - -mpr
            - --queue-pattern
            - "$(QUEUE_PATTERN)-%d"
            - --queue-pattern-from
            - "1"
            - --queue-pattern-to
            - "$(QUEUE_COUNT)"
            - --producer-random-start-delay
            - "60"
            - --producers
            - "1"
            - --confirm
            - "100"
            - --confirm-timeout
            - "5"
            - --consumers
            - "2"
            - --qos
            - "100"
            - --multi-ack-every
            - "10"
            - -s
            - "1000"
            - --rate
            - "1000"
            - --use-millis
          ports:
            - name: metrics
              containerPort: 8080

Queues are replicated across 3 brokers (leader + 2 replicas)
We use a statefulset so we can scale load by scaling the replica count
We target 1Mb/s per pod via 1Kb messages at 1000 rate (similar to the benchmarks in the RMQ blog)
We use confirms at 10% according to benchmarks in the RMQ blog
We use mult-ack at 10% according to benchmarks in the RMQ blog

With this setup we can reach around 30/40K msg/s, but even with more PerfTest pods we cant seem to get more performance and our PerfTest metrics start to go through the roof.

We tried:

Increasing IOPS/MB of the storage layer (despite no signs of this being the limit)
- We saw some Raft commit latency, but the disk seem to have spare performance
Differing tuning parameters
- sysctl network
- quorum_commands_soft_limit
- Disable Nagle's Algorithm ("nodelay")
Tuning the producers/consumers (least preferred option as it requires deep involvement on each service)

Any help would be appreciated.

A bit of context: We are essentially trying to run RMQ as a platform for our teams but we have so far found no way of knowing when the cluster is reaching its limits and struggling quite a bit to guarantee or tune performance in this multi-tenancy scenario. eg. A client pushed fat messages, or a client shoves a lot of data (100k/s) suddenly to the cluster, etc. So we are exploring multi-cluster and other options, but at its core we are struggling with having good observability into our cluster, each time we had to analyze bad/good performance we had to dive into how each consumer/producer is using the cluster (is it using confirms, acks, etc).

michaelklishin · 2023-05-06T12:35:20Z

michaelklishin
May 6, 2023
Maintainer

40K-50K per quorum queue is reasonable. You can grow up to a point by using more queues. With PerfTest, that may require a different set of queue name patterns.

Network I/O, disk I/O, and CPU load can all be good indicators. Queue leader distribution can help, too.

If you need more than 50K per queue, you can always use streams. You can reach several million messages a second with a few streams or superstreams with three replicas. Note that streams in part achieve this via parallelism: clients connect to ALL replicas when they can. If all connections go via a load balancer, the results will differ.

0 replies

pecigonzalo · 2023-05-06T13:35:11Z

pecigonzalo
May 6, 2023
Author

Hey Michael, thanks for your reply. 40/50K with X number of queues. I would scale the StatefulSet to X number of replicas, each would create its own queue. I looked at Network, Disk and CPU all look healthy, that is why I’m a bit puzzled onto what is hitting the limit.

…

On Sat, 6 May 2023 at 14:35 Michael Klishin ***@***.***> wrote: 40K-50K per quorum queue is reasonable. You can grow up to a point by using more queues. With PerfTest, that may require a different set of queue name patterns. Network I/O, disk I/O, and CPU load can all be good indicators. Queue leader distribution can help, too. If you need more than 50K per queue, you can always use streams. You can reach several million messages a second with a few streams or superstreams <https://blog.rabbitmq.com/posts/2022/07/rabbitmq-3-11-feature-preview-super-streams/> with three replicas. Note that streams in part achieve this via parallelism: clients connect to ALL replicas <https://blog.rabbitmq.com/posts/2021/07/connecting-to-streams/> when they can. If all connections go via a load balancer, the results will differ. — Reply to this email directly, view it on GitHub <#8113 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLS7COKHMTDPDUV6X4VE7LXEZARHANCNFSM6AAAAAAXXLCDYA> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

1 reply

michaelklishin May 6, 2023
Maintainer

If clients predominantly to one node, it won't help to have more nodes.

Streams do not have this problem when clients can discover and connect to the specific nodes they need. The stream protocol has this feature but none of the messaging protocols RabbitMW supports does.

pecigonzalo · 2023-05-06T14:00:02Z

pecigonzalo
May 6, 2023
Author

Thanks, the clients are distributed through the cluster dns. We are not looking to use streams, we are looking to understand why is RMQ Queues are not providing more performance given no signs of resource exhaustion.

…

On Sat, 6 May 2023 at 15:53 Michael Klishin ***@***.***> wrote: If clients predominantly to one node, it won't help to have more nodes. Streams do not have this problem when clients can discover and connect to the specific nodes they need. The stream protocol has this feature but none of the messaging protocols RabbitMW supports does. — Reply to this email directly, view it on GitHub <#8113 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLS7CKQGNLITEK4YKCKS3LXEZJUXANCNFSM6AAAAAAXXLCDYA> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

0 replies

lukebakken · 2023-05-07T14:46:58Z

lukebakken
May 7, 2023
Maintainer

It looks like you're connecting to a load balancer -

- --uri "amqp://perf:[email protected]"

...more than likely this results in all queue leaders running on the same node. @michaelklishin alluded to that in this comment

I recommend taking the load balancer out of the equation and specifying your connection like this:

- --uris amqp://node-0.rabbitmq.svc.cluster.local,amqp://node-1.rabbitmq.svc.cluster.local,amqp://node-2.rabbitmq.svc.cluster.local

2 replies

pecigonzalo May 7, 2023
Author

I'm using the ClusterIP Service load balancer as created by the https://github.com/rabbitmq/cluster-operator. This does not seem to be affecting queue creation or leadership.

I'll configure the clients to connect directly and test again, but I believe the common setup used with the operator is to use the Service created by the operator, right?

Regarding queue creation, does queue creation depend on the load balancer? My understanding and experience are that queue leadership is "balanced" across the nodes by default unless a node goes down and leadership moves.

I believe what @michaelklishin was alluding to is the per-queue limit that RMQ, which seems to be around 30/40k msgs/s as well (likely dependent on single core performance).

lukebakken May 7, 2023
Maintainer

Regarding queue creation, does queue creation depend on the load balancer?

What I observed recently using this project, is that even though PerfTest connects to the load balancer, it only uses one connection to declare queues, which results in all queues' leaders being on one node.

If you've checked and your queue leaders are balanced, then my suggestion won't make a difference.

mkuratczyk · 2023-05-09T08:55:03Z

mkuratczyk
May 9, 2023
Maintainer

Having a single metric to express "the server is overloaded" would be awesome but would be extremely hard/impossible. Based on your description, I'm fairly sure the bottleneck for you is the ra_log_wal process, which is shared by all quorum queues on a given node. With a single QQ, you can probably overload the QQ process first, but with more QQs, all their operations go through the WAL file which becomes the bottleneck. Running rabbitmq-diagnostics observer sorted by reductions (rr[ENTER]) should confirm this. Adding more nodes to the cluster would only help if you reduce the number of QQs per node (eg. you go from 3 nodes to 7 nodes and make sure that queue members are spread roughly evenly).

I'm not clear on what your expectation is here. If you thought that since you can do 30k/s with a single queue then you should be able to do 300k/s with 10 queues, then it's not something RabbitMQ can do (it'd be nice though! ;) ). We have PoC branch with multiple WAL files (a queue, when it's declared, is assigned to one of them). This could help with scaling but we need to dig it up, and test again. It'll also require more work as we can foresee all kinds of additional challenges (users trying to change the number of WAL up/down, etc).

We keep improving performance in many areas and would love to do that based on a real use case. If you can provide the details of a real workload you are planning to run (or you wish RabbitMQ could handle), please share it.

11 replies

pecigonzalo May 11, 2023
Author

Yeah, that is why I learned some Erlang and beam. What I mean is that ideally the more we can abstract that away into RMQ concepts instead of depending on pre-existing beam/erlang knowledge, the better. eg. What does erlang_vm_statistics_run_queues_length mean from a RMQ standpoint?

Lets take for example Postgres or MySQL, both are widely used platforms written in C/C++ and for 90% of the users they don't need to know C or C++.

disk throttling which showed the disk use being "peaky"

Ill check for this, we had similar issues with the Network before in which we did not see the bandwidth hit the limit, but we saw throttling due to spikes.

btw, which version of RabbitMQ are you using?

Oh sorry, I thought it was in the cluster config

RabbitMQ: 3.11.11
Erlang: 25.3

pecigonzalo May 11, 2023
Author

I checked erlang_vm_statistics_run_queues_length and periodically goes over 0 but nothing crazy.

I also found this interesting

I have no name!@cluster-a-server-2:/$ rabbitmq-diagnostics runtime_thread_stats
Will collect runtime thread stats on [email protected] for 5 seconds...
Average thread real-time    :  5000833 us
Accumulated system run-time : 32975822 us
Average scheduler run-time  :  1300194 us

        Thread    alloc      aux      bifbusy_wait check_io emulator      ets       gc  gc_full      nif    other     port     send    sleep   timers

Stats per thread:
[...]
Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
           aux    0.08%    0.24%    0.00%    0.00%    0.18%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.50%    0.00%
dirty_cpu_sche    0.23%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.48%    0.23%    0.00%    0.00%    0.00%    0.00%   99.05%    0.00%
dirty_io_sched    0.04%    0.00%    0.00%    0.01%    0.00%    0.15%    0.00%    0.00%    0.00%   22.29%    0.08%    0.00%    0.00%   77.44%    0.00%
          poll    0.13%    0.00%    0.00%    0.00%    1.88%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   97.99%    0.00%
     scheduler    3.04%    0.96%    3.82%    0.13%    0.35%    6.68%    1.60%    1.56%    0.18%    0.02%    1.32%    4.49%    1.83%   74.00%    0.02%

Essentially most of the time they are sleeping :D I guess this is related to the ra_log_wal hypothesis right?

mkuratczyk May 11, 2023
Maintainer

Yes, most schedulers will be idle since they can't progress before ra_log_wal completes the writes and a single Erlang process can't use more than 1 CPU core at a given time (and in practice, it will use less than 1). Simply adding more WALs won't magically solve the problem, since they perform fsyncs which will start stepping one on another if too many processes do that (but as I mentioned, we want to experiment with more than 1).

As for the C/C++ skills, I'd say it's a difference between programming languages with a runtime and without. If there's a runtime, you need some understanding of it as it's a key part of what you are operating. Exposing some additional info on RabbitMQ level is not a bad idea as such (and we do that in certain areas already) but also won't magically solve the underlying problem nor necessarily simplify the understanding of a distributed system where each node runs multiple threads (schedulers) and each thread runs multiple Erlang processes, all while interacting with client apps, network I/O, disk I/O and operating-system mechanisms.

pecigonzalo May 11, 2023
Author

If I remember correctly, the message store for normal queues has a process per vhost but this is not the case for ra_log_wal right?

As for the C/C++ skills, I'd say it's a difference between programming languages with a runtime and without. If there's a runtime, you need some understanding of it as it's a key part of what you are operating. ...

Yeah, Im just illustrating an example so I chose a DB. I think the same applies to a lot of software. Not saying that "no one should need to know" in any case, just that more we can say in terms of RMQ instead of Erlang the better in my opinion.

mkuratczyk May 11, 2023
Maintainer

For classic queues, there is indeed a per-vhost message store. However, it's only used for messages above queue_index_embed_msgs_below in size (defaults to 4kb but we'll likely increase that limit soon). Messages smaller than that are embedded in the index or stored in a per-queue message store (classic queues v2). Classic queues also don't fsync so things scale better with the number of writers.

Cluster with all Quorum Queues performance limits #8113

Uh oh!

pecigonzalo May 5, 2023

Replies: 5 comments · 14 replies

Uh oh!

michaelklishin May 6, 2023 Maintainer

Uh oh!

pecigonzalo May 6, 2023 Author

Uh oh!

michaelklishin May 6, 2023 Maintainer

Uh oh!

pecigonzalo May 6, 2023 Author

Uh oh!

lukebakken May 7, 2023 Maintainer

Uh oh!

pecigonzalo May 7, 2023 Author

Uh oh!

lukebakken May 7, 2023 Maintainer

Uh oh!

mkuratczyk May 9, 2023 Maintainer

Uh oh!

pecigonzalo May 11, 2023 Author

Uh oh!

pecigonzalo May 11, 2023 Author

Uh oh!

mkuratczyk May 11, 2023 Maintainer

Uh oh!

pecigonzalo May 11, 2023 Author

Uh oh!

mkuratczyk May 11, 2023 Maintainer

pecigonzalo
May 5, 2023

Replies: 5 comments 14 replies

michaelklishin
May 6, 2023
Maintainer

pecigonzalo
May 6, 2023
Author

michaelklishin May 6, 2023
Maintainer

pecigonzalo
May 6, 2023
Author

lukebakken
May 7, 2023
Maintainer

pecigonzalo May 7, 2023
Author

lukebakken May 7, 2023
Maintainer

mkuratczyk
May 9, 2023
Maintainer

pecigonzalo May 11, 2023
Author

pecigonzalo May 11, 2023
Author

mkuratczyk May 11, 2023
Maintainer

pecigonzalo May 11, 2023
Author

mkuratczyk May 11, 2023
Maintainer