quorum queues go to noproc on restarting of nodes. New issue with 3.13 #11712

hillen · 2024-07-14T19:18:51Z

hillen
Jul 14, 2024

We have six different kubernetes clusters with three nodes each. In the last week, after updating to 3.13.4, we have been having issues where on a node image update the nodes have had the quorum queues go into a noproc state. It does not always happen, but happens more than half of the time. And it usually only happens for one of the nodes.
Once it gets into the state, it will log hundreds of these lines per second:
2024-07-12 03:47:43.972884+00:00 [warning] <0.6137.0> Mnesia->Khepri data copy: failed to setup Khepri projection for migration "rabbitmq_metadata", expect slower versions of is_migration_finished() and handle_fallback()
2024-07-12 03:47:43.972884+00:00 [warning] <0.6137.0> {error,noproc}

We have never enabled Khepri on any of our clusters.

We have been using rabbitmq in kubernetes for around 3.5 years and quorum queues for about a year and have not seen an issue like this before.

Once this happens, the only way that we have found to fix it is to delete the /opt/bitnami/rabbitmq/.rabbitmq/mnesia/[email protected]/quorum/[email protected] directory on the node with the noproc and to restart all of the nodes.

On the version update from 3.13.3 to 3.13.4 everything seemed fine. But the next time the nodes were restarted, the issue occurred. We did go back to 3.13.3 and on a restart it did the same thing. However, we were running 3.13.3 for over a month and some node restarts occurred and the issue did not occur. We did not see the issue until we moved to 3.13.4. Yet, going back to 3.13.3 and restarting the nodes it does occur. We cannot try 3.13.2 because the message_containers_deaths_v2 feature flag was enabled.

I am attaching the log of the failing node after the restart. I did not capture the logs on the previous shutdown, but I can probably pull them from our log storage if needed. We could also recreate the cluster with a version before 3.13.3 in one of our dev environments if necessary.
rabbitmq-2-node.txt

Answered by kjnilsson

Jul 19, 2024

Here is a PR that addresses this issue #11758

View full answer

dumbbell · 2024-07-17T14:52:30Z

dumbbell
Jul 17, 2024
Maintainer

Hi!

The warning comes from the code that is responsible for selecting the correct database. Therefore it is used regardless of the active database. The situation is harmless and won't cause any issue. In this case, Mnesia is correctly selected.

The noproc reason is a bit weird as Khepri should run, even if not used. I will update the code to take this error into account and will rephrase the log message to be less scary. Perhaps I will also change the severity.

But in other words, you can ignore the warning.

0 replies

miklosfarkas · 2024-07-17T15:21:22Z

miklosfarkas
Jul 17, 2024

I have a similar problem as well. Our system also logs a hundred lines per second. However, in our case, besides the warnings, there is also a phenomenon that after a few hours of operation, the problematic node drops out of the cluster. The node comes back after a stop and start. We are now planning to replace the problematic node with a new one.

6 replies

kjnilsson Jul 17, 2024
Maintainer

@miklosfarkas logs would be really helpful here. Do you also have issues with quorum queues?

miklosfarkas Jul 18, 2024

Hello, sorry I was offline!
Yes, I can share anything you need. I am downloading the log now. The problematic node has 29 GB, so please be patient

miklosfarkas Jul 18, 2024

The meantime it turned out that the specific node was dropped from the cluster because the backup stopped it.
The other nodes did not show the error because the virtual machines were not to large enough to take that long to backup
I changed the log level to error, so the warnings have disappeared and no more GB data is generated in the log

miklosfarkas Jul 18, 2024

do you still need the data?

kjnilsson Jul 18, 2024
Maintainer

Only if your issue isn't the same as the original reporter. Can you inspect the logs they provided and compare?

kjnilsson · 2024-07-17T16:50:29Z

kjnilsson
Jul 17, 2024
Maintainer

Is there any chance you could share the quorum directory with me taken when one of these events happen?

31 replies

hillen Jul 18, 2024
Author

Do you have any idea when this fix may be released in a new version of rabbitmq?
Do you know of anything that can be done to patch a failing node other than delete the quorum directory? So far, we have only seen one node get into the noproc state each time that it failed. And other than some queues backing up because any service connected to that node does not get messages, the effect has been fairly minimal. But if two nodes would go out, then I believe we would start seeing the environment having noticeable issues and perhaps messages would be lost if the quorum directory had to be deleted from two nodes.

kjnilsson Jul 19, 2024
Maintainer

We need to validate the fix - right now I am struggling to reproduce the issue inside RabbitMQ. Once we have validated I think the fix will go out relatively soon but can't give any definitive dates.

kjnilsson Jul 19, 2024
Maintainer

How many cpu cores do your RabbitMQ servers have?

kjnilsson Jul 19, 2024
Maintainer

Here is a PR that addresses this issue #11758

Answer selected by michaelklishin

michaelklishin Jul 21, 2024
Maintainer

https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.13.5 is out. Our team cannot comment on when Bitnami images might adopt this version. You are welcome to politely ask them to do it ASAP.

michaelklishin · 2024-07-21T23:49:19Z

michaelklishin
Jul 21, 2024
Maintainer

https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.13.5 is out. Our team cannot comment on when Bitnami images might adopt this version. You are welcome to politely ask them to do it ASAP.

2 replies

hillen Jul 22, 2024
Author

The servers running rabbitmq are the following: Standard_D4as_v5
We are running three kubernetes nodes, one in each availability zone, and one rabbitmq node on each kubernetes node.
Other than the daemonsets, this is all that is running on the kubernetes nodes.
So, four cores on the VMs with rabbitmq having the ability to use most of them.

hillen Jul 22, 2024
Author

Thanks for getting the release out quickly.
Looks like bitnami built the release around 4:00am EDT Saturday:
31407225 Jul 20 03:59 rabbitmq-3.13.5-0-linux-amd64-debian-12.tar.gz
However, the container bot has not changed the container Dockerfile and released it. From the history, it looks like a lot of this is done around 6:30am but I logged an issue just to make sure something is not stuck in the process.
bitnami/containers#69428

quorum queues go to noproc on restarting of nodes. New issue with 3.13 #11712

Uh oh!

hillen Jul 14, 2024

Replies: 4 comments · 39 replies

Uh oh!

dumbbell Jul 17, 2024 Maintainer

Uh oh!

miklosfarkas Jul 17, 2024

Uh oh!

kjnilsson Jul 17, 2024 Maintainer

Uh oh!

miklosfarkas Jul 18, 2024

Uh oh!

miklosfarkas Jul 18, 2024

Uh oh!

miklosfarkas Jul 18, 2024

Uh oh!

kjnilsson Jul 18, 2024 Maintainer

Uh oh!

kjnilsson Jul 17, 2024 Maintainer

Uh oh!

hillen Jul 18, 2024 Author

Uh oh!

kjnilsson Jul 19, 2024 Maintainer

Uh oh!

kjnilsson Jul 19, 2024 Maintainer

Uh oh!

kjnilsson Jul 19, 2024 Maintainer

Uh oh!

michaelklishin Jul 21, 2024 Maintainer

Uh oh!

michaelklishin Jul 21, 2024 Maintainer

Uh oh!

hillen Jul 22, 2024 Author

Uh oh!

hillen Jul 22, 2024 Author

hillen
Jul 14, 2024

Replies: 4 comments 39 replies

dumbbell
Jul 17, 2024
Maintainer

miklosfarkas
Jul 17, 2024

kjnilsson Jul 17, 2024
Maintainer

kjnilsson Jul 18, 2024
Maintainer

kjnilsson
Jul 17, 2024
Maintainer

hillen Jul 18, 2024
Author

kjnilsson Jul 19, 2024
Maintainer

kjnilsson Jul 19, 2024
Maintainer

kjnilsson Jul 19, 2024
Maintainer

michaelklishin Jul 21, 2024
Maintainer

michaelklishin
Jul 21, 2024
Maintainer

hillen Jul 22, 2024
Author

hillen Jul 22, 2024
Author