New node in a cluster with thousands of Shovels takes too long to restart (more than 40 minutes) #9714

yohanAnushka · 2023-10-17T10:56:14Z

yohanAnushka
Oct 17, 2023

Describe the bug

Hi,
I have a 3 node RabbitMQ cluster where doing rabbitmqctl stop_app && rabbitmqctl start_app can take more than 40 minutes. The stop_app command finishes fairly quickly, within a minute or two while the consecutive start_app command can take more than 40 minutes.

System Specifications:

RabbitMQ: 3.9.13
Erlang: 24.2.1
OS: Ubuntu 22.04 (Canonical:0001-com-ubuntu-confidential-vm-jammy:22_04-lts-cvm:22.04.202307140)
Architecture: x86_64

Cluster Specification:

Number of nodes: 3
- [email protected]
- [email protected]
- [email protected]
Node specification: 4 vCPU | 16GB memory | Azure: Standard D4as v5
RabbitMQ definitions
- vhosts: 723
- shovels: 1423
- queues: 4780
- exchanges: 6757
- channels: 2862
- connections: 2862
  All cluster nodes reside in the same Azure VNet with all ports opened within the VNet. Each node has its own network interface and a public IP assigned to each network interface.

Issue in Detail:
I am recreating this issue in a cluster that does not get any outside traffic, so other than the shovels, there are no consumers or producers that connect to the cluster externally. Therefore the messages/s is basically 0. Now, if I run rabbitmqctl stop_app in app-prod-rabbitmq-c node, it will complete in less than 2 minutes. And then when I run rabbitmqctl start_app in the same node, this process takes more than 40 miuntes. While the process is running, I can see clustering, http/management ports up and running in this node. But amqp and http/prometheus ports are not available during this phase. In the logs I do see a lot of activity and related to setting up dynamic shovels which implies that the start up process is not exactly stuck but is actually doing something. But is it normal for a RabbitMQ node to take such a long time to restart?

The typical shovel configuration looks something like below, and both source and destination exist within the same cluster.

{
      "component": "shovel",
      "name": "to-vhost-a",
      "value": {
        "ack-mode": "on-confirm",
        "dest-add-forward-headers": true,
        "dest-exchange": "command",
        "dest-protocol": "amqp091",
        "dest-uri": "amqp:///vhost-a",
        "src-delete-after": "never",
        "src-exchange": "command",
        "src-exchange-key": "*.<uuid>.#",
        "src-prefetch-count": 1,
        "src-protocol": "amqp091",
        "src-uri": "amqp:///vhost-b"
      },
      "vhost": "vhost-b"
    }

Also note that import the original definitions into a fresh 3 node cluster only takes about 10 minutes.

Reproduction steps

Setup a 3 node RabbitMQ cluster that contains more than 500 vhosts and more than 1400 shovels where each shovel would copy messages from a queue in one vhost to a queue in another vhost.
Run rabbitmqctl stop_app and then rabbitmqctl start_app in the 3rd node (The node number really does not matter).
Observe the time it takes for start_app command to finish executing.

Expected behavior

The 3rd node to finish starting up in at least 10 minutes and bring amqp and http/prometheus ports online.

Additional context

The reason why we have such a high number of shovels is because we need to pass messages between each customer vhost and our own managed services vhost. We do so at an exchange level (w/ both command and event exchanges).

Answered by mkuratczyk

Oct 26, 2023

It should get much faster in the next 3.11/3.12 patch release. From a few minutes to a few seconds to import 1000 shovels to a single node and from 20-30 minutes to scale the cluster to just seconds. Thanks for reporting this.

#9785

View full answer

michaelklishin · 2023-10-17T11:03:46Z

michaelklishin
Oct 17, 2023
Maintainer

Starting hundreds of virtual hosts can take a long time, since every virtual host waits for all nodes to report a success.

2 replies

yohanAnushka Oct 17, 2023
Author

I see, but when only one node is restarted, the vhost only have to wait for that one node to report back a success right? Or does it still talk to the other 2 nodes as well and wait for their responses?

michaelklishin Oct 17, 2023
Maintainer

There are several things to consider here:

That on any node, a virtual host startup awaits all other nodes to start a local process tree and report to its peers. So the startup time of a virtual host is limited by the slowest node to do it
It is fairly expensive to start a virtual host (compared to starting a queue of any type or a stream)
During upgrades or restarts, queue leaders can get out of balance, which means some nodes will have a lot more replicas to start, which means that they will take proportionally longer to boot

lukebakken · 2023-10-17T12:29:22Z

lukebakken
Oct 17, 2023
Maintainer

@yohanAnushka any time you report a software issue, you must say what version you are using. In this case it would be:

RabbitMQ version
Erlang version
Operating system and version

4 replies

yohanAnushka Oct 17, 2023
Author

My bad, totally missed it. I updated the original post with below info.

RabbitMQ: 3.9.13
Erlang: 24.2.1
OS: Ubuntu 22.04 (Canonical:0001-com-ubuntu-confidential-vm-jammy:22_04-lts-cvm:22.04.202307140)
Architecture: x86_64

michaelklishin Oct 17, 2023
Maintainer

3.9 has reached EOL. There were several queue startup and message store/index recovery improvements since then, up to 3.12.

Nothing that would speed up the startup of 700+ virtual hosts but for queues, in some cases the recovery process takes a good half of what it was before (say, in 3.10).

yohanAnushka Oct 18, 2023
Author

Hi @michaelklishin , I ran a couple of tests on both RabbitMQ versions; Here's what I found. Also, from what I understood, the more number of nodes that a cluster have, the longer it will take for a new node to full synchronize and startup all the vhosts since the new node has to talk to all the other nodes in the cluster for every vhost that is has to setup within itself.

Scenario 1 - Start from a single node cluster and increase it to a 3 node cluster, one by one w/ RabbitMQ 3.9.
So for this one, I created a single node cluster and imported our RabbitMQ definitions which consist of 700+ vhosts and 1400+ shovels. And then the 2nd node was added and I observed how much time it took to complete starting up and then I ran rabbitmq-queues grow <new node hostname> "all" followed by rabbitmq-queues rebalance "all"`, and then repeated this process until I had 4 nodes in the cluster.

RabbitMQ: 3.9.13
Erlang: 24.2.1
OS: Ubuntu 22.04 (Canonical:0001-com-ubuntu-confidential-vm-jammy:22_04-lts-cvm:22.04.202307140)
Architecture: x86_64
Node specification: 4 vCPU | 16GB memory | Azure: Standard D4as v5

Task	Time Taken
Importing definitions to node-1 in a single node cluster	`13 minutes`
Time taken by node-2 to join node-1	`59 minutes`
Time taken by node-3 to join node-1 and node-2	`49 minutes`
Time taken by node-4 to join node-1, node-2, and node-3	`42 minutes`

Scenario 2 - Start from a three node cluster, and join a 4th node later on w/ RabbitMQ 3.12.
Here I wanted to check what the performance benefit in moving to RabbitMQ version 3.12. I started out with a three node cluster, and then importing our RabbitMQ definitions which consist of 700+ vhosts and 1400+ shovels. Once the importation is completed, I tried to get the 4th node to join the cluster.

RabbitMQ: 3.12.6
Erlang: 25.2.3
OS: Ubuntu 23.04 (Canonical:0001-com-ubuntu-vm-lunar:23_04-gen2:23.04.202307120)
Architecture: x86_64
Node specification: 4 vCPU | 16GB memory | Azure: Standard D4as v5

Task	Time Taken
Importing definitions in `node-1` of a three node cluster	`15 minutes`
Time taken by `node-4` to join the three node cluster	`34 minutes`

So I do see an improvement in start ups but I would still love to have a duration much less than 34 minutes but I believe that with the number of shovels and vhosts we have, 30 minutes for the startup is right around the expected time?

michaelklishin Oct 18, 2023
Maintainer

Reducing the number of virtual hosts should help. Shovels are also stateful entities so they take time to start but we don't often see Shovels being the key factor. A Shovel is started on one node and its startup is generally entirely asynchronous to the rest of the node.

If you need a lot of queues for fanouts, perhaps some of them can be replaced with a small number of streams, or even just one.

Definition import logs before importing every new "category", so that may help understand which one contributes the most.

Alternatively you can try a definition file with 1 virtual host vs. N (but the same numbers of queues, Shovels) or the same file without Shovels, to narrow things down.

mkuratczyk · 2023-10-23T09:40:30Z

mkuratczyk
Oct 23, 2023
Maintainer

@yohanAnushka a large number of vhosts is almost certainly the culprit. There's a lot of mnesia locking going on in such scenarios. It's worth mentioning however, that we are getting close to replacing Mnesia altogether. RabbitMQ 3.13-rc1 was shipped last week which already adds experimental (opt-in) support for Khepri (our new metadata store). I just ran a quick test to import 1000 empty vhosts and then stop/start a node (no clustering).

With Mnesia, I get:

import: 3m12s
stop_app: 12s
start_app: 5m50s

With Khepri (in 3.13, you need to do rabbitmqctl enable_feature_flag khepri_db to opt-in):

import: 22s
stop_app: 9s
start_app: 17s

This is not to say that every scenario/situation will get this level of improvement, but we are addressing your issue as part of this multi-year effort of replacing how metadata is stored/replicated within the cluster.

It'd be fantastic if you could re-run your tests with 3.13-rc1 and Khepri enabled (again, you need to enable khepri_db feature flag).

9 replies

yohanAnushka Oct 25, 2023
Author

@michaelklishin, my bad I was on RabbitMQ 3.10.8 on Erlang 25.2.3 but I'll re-run the tests with RabbitMQ 3.12.

yohanAnushka Oct 25, 2023
Author

@mkuratczyk @michaelklishin I can the tests again on RabbitMQ 3.12.7 and Erlang 25.2.3 and the results are similar to what I observed with 3.10.

Task	Time Taken for Importation	CumulativeTime Taken for Importation	Time Taken by New Node to Join Cluster
Vhosts + Users	`3m 0s`	`3m 0s`	`3m 25s`
Queues	`4m 13s`	`7m 13s`	`3m 34s`
Exchanges + Bindings	`1m 3s`	`8m 16s`	`4m 35s`
Shovels	`5m 37s`	`13m 53s`	`36m 39s`

So here's how I ran the test.

I created a new 3 node cluster.
Then for the first test I imported only the users' and vhosts' definitions.
- I first observed the time taken for importing.
- Then I ran rabbimtqclt join ... command on the 4th node and checked how long it takes to start up.
Then I removed the 4th node from the cluster and reset it completely.
Then I imported queues' definitions in the existing 3 node cluster.
- I observed the time taken for importing.
- Then I ran rabbitmqctl join ... on the 4th node that was reset earlier and checked how long it takes to start up.
This process was repeated for importing Exchanges + Bindings and then lastly for Shovels.

The time taken by the 4th node to join the cluster was always around 3 to 4 minutes as long as the cluster did not contain any shovels. The moment I introduced the shovels the joining time jumped to 39 minutes.

I am going to anonymize my definitions JSON file and share it soon with @mkuratczyk.

mkuratczyk Oct 25, 2023
Maintainer

I think I can reproduce this even without the file (but feel free to still share it). I just generated a JSON file with 1000 shovels and it takes 2 minutes to import it into a single node - that's already much longer than I'd expect. If I try to import the second thousand, that takes almost 6 minutes. Clearly it gets slower over time. I will investigate - this is probably something we can optimise (how that translates into synchronisation when joining nodes may turn out to be a separate problem, but maybe not).

mkuratczyk Oct 26, 2023
Maintainer

It should get much faster in the next 3.11/3.12 patch release. From a few minutes to a few seconds to import 1000 shovels to a single node and from 20-30 minutes to scale the cluster to just seconds. Thanks for reporting this.

#9785

Answer selected by michaelklishin

yohanAnushka Oct 27, 2023
Author

Thank you @mkuratczyk, really appreciate it. And thank you @michaelklishin as well. I will report back the performance improvements once I test the patch in our non-prod environment.

Cheers.

New node in a cluster with thousands of Shovels takes too long to restart (more than 40 minutes) #9714

Uh oh!

Uh oh!

yohanAnushka Oct 17, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 15 replies

Uh oh!

michaelklishin Oct 17, 2023 Maintainer

Uh oh!

yohanAnushka Oct 17, 2023 Author

Uh oh!

michaelklishin Oct 17, 2023 Maintainer

Uh oh!

lukebakken Oct 17, 2023 Maintainer

Uh oh!

yohanAnushka Oct 17, 2023 Author

Uh oh!

michaelklishin Oct 17, 2023 Maintainer

Uh oh!

yohanAnushka Oct 18, 2023 Author

Uh oh!

Uh oh!

michaelklishin Oct 18, 2023 Maintainer

Uh oh!

mkuratczyk Oct 23, 2023 Maintainer

Uh oh!

yohanAnushka Oct 25, 2023 Author

Uh oh!

yohanAnushka Oct 25, 2023 Author

Uh oh!

mkuratczyk Oct 25, 2023 Maintainer

Uh oh!

mkuratczyk Oct 26, 2023 Maintainer

Uh oh!

Uh oh!

yohanAnushka Oct 27, 2023 Author

yohanAnushka
Oct 17, 2023

Replies: 3 comments 15 replies

michaelklishin
Oct 17, 2023
Maintainer

yohanAnushka Oct 17, 2023
Author

michaelklishin Oct 17, 2023
Maintainer

lukebakken
Oct 17, 2023
Maintainer

yohanAnushka Oct 17, 2023
Author

michaelklishin Oct 17, 2023
Maintainer

yohanAnushka Oct 18, 2023
Author

michaelklishin Oct 18, 2023
Maintainer

mkuratczyk
Oct 23, 2023
Maintainer

yohanAnushka Oct 25, 2023
Author

yohanAnushka Oct 25, 2023
Author

mkuratczyk Oct 25, 2023
Maintainer

mkuratczyk Oct 26, 2023
Maintainer

yohanAnushka Oct 27, 2023
Author