-
We have six different kubernetes clusters with three nodes each. In the last week, after updating to 3.13.4, we have been having issues where on a node image update the nodes have had the quorum queues go into a noproc state. It does not always happen, but happens more than half of the time. And it usually only happens for one of the nodes. We have never enabled Khepri on any of our clusters. We have been using rabbitmq in kubernetes for around 3.5 years and quorum queues for about a year and have not seen an issue like this before. Once this happens, the only way that we have found to fix it is to delete the /opt/bitnami/rabbitmq/.rabbitmq/mnesia/[email protected]/quorum/[email protected] directory on the node with the noproc and to restart all of the nodes. On the version update from 3.13.3 to 3.13.4 everything seemed fine. But the next time the nodes were restarted, the issue occurred. We did go back to 3.13.3 and on a restart it did the same thing. However, we were running 3.13.3 for over a month and some node restarts occurred and the issue did not occur. We did not see the issue until we moved to 3.13.4. Yet, going back to 3.13.3 and restarting the nodes it does occur. We cannot try 3.13.2 because the message_containers_deaths_v2 feature flag was enabled. I am attaching the log of the failing node after the restart. I did not capture the logs on the previous shutdown, but I can probably pull them from our log storage if needed. We could also recreate the cluster with a version before 3.13.3 in one of our dev environments if necessary. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 39 replies
-
Hi! The warning comes from the code that is responsible for selecting the correct database. Therefore it is used regardless of the active database. The situation is harmless and won't cause any issue. In this case, Mnesia is correctly selected. The But in other words, you can ignore the warning. |
Beta Was this translation helpful? Give feedback.
-
I have a similar problem as well. Our system also logs a hundred lines per second. However, in our case, besides the warnings, there is also a phenomenon that after a few hours of operation, the problematic node drops out of the cluster. The node comes back after a stop and start. We are now planning to replace the problematic node with a new one. |
Beta Was this translation helpful? Give feedback.
-
Is there any chance you could share the quorum directory with me taken when one of these events happen? |
Beta Was this translation helpful? Give feedback.
-
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.13.5 is out. Our team cannot comment on when Bitnami images might adopt this version. You are welcome to politely ask them to do it ASAP. |
Beta Was this translation helpful? Give feedback.
Here is a PR that addresses this issue #11758