Rabbitmq stream coordinator fails to start after the restore of PVC snapshot backup using Kasten k10 #8060

andreapulvirenti · 2023-04-28T09:55:20Z

andreapulvirenti
Apr 28, 2023

Describe the bug

We need to backup the snapshot of the volumes for rabbitmq streams with all its messages. in order to recover it.
Rabbitmq it's deployed in a AKS cluster through the bitnami chart (version 11.13.0) in High Availabilty mode with 3 nodes (PodAntiaffnityPreset: Hard).
Kasten is deployed with version 1.0.31
https://docs.kasten.io/latest/install/azure/azure.html

Accordingly with documentation of rabbitmq:

https://www.rabbitmq.com/backup.html

that I quotes:

To back up messages on a node it must be first stopped.

In the case of a cluster with replicated queues, it is highly recommended to stop the entire cluster over a period of time to take a backup. If instead one node is topped at a time, queues may accumulate duplicates, exactly like when you back up a single running node.

If the majority of cluster nodes is stopped rapidly quorum queues may lose their availability, and as a result miss a small percentage of recent publishes to them.

we made a custom blueprint with pre-backup hook and post-backup hook like this:

apiVersion: cr.kanister.io/v1alpha1
kind: Blueprint
metadata:
  name: volume-snapshot-blueprint
  namespace: kasten-io
actions:
  backupPrehook:
    kind: StatefulSet
    phases:
      - args:
          command:
            - /bin/sh
            - '-c'
            - |
              rabbitmqctl stop_app
          container: rabbitmq
          namespace: namespace1
          pod: rabbitmq-0
        func: KubeExec
        name: rabbitfreeze
  backupPosthook:
    kind: StatefulSet
    phases:
      - args:
          command:
            - /bin/sh
            - '-c'
            - |
              rabbitmqctl start_app
          container: rabbitmq
          namespace: namespace1
          pod: rabbitmq-0
        func: KubeExec
        name: rabbitunfreeze

what we are trying to achive is that the rabbitmq application stay up and running during the backup of the PersistentVolume so as you can see we stopped the node 0 of the statefulset and restart it after backup. We backup and restore JUST the node 0 of the rabbitmq application.
That's seems to work.

we expecting that restoring of node 0 would mirror all the stream queue in the other nodes.
the restore of kasten is completed successfully but the stream queues restored are all in down state
as you can see here:

even if we try to create a new stream rabbitmq freeze for some seconds and after few seconds this error is prompt:

logs says:

rabbitmq
2023-04-28 09:02:56.774251+00:00 [error] <0.13551.20>
rabbitmq
2023-04-28 09:02:57.407828+00:00 [warning] <0.32479.22> Coordinator timeout on server 'rabbit@rabbitmq-0.rabbitmq-headless.tp-oiadp-bus.svc.cluster.local' when processing command new_stream
rabbitmq
2023-04-28 09:02:57.426632+00:00 [warning] <0.32479.22> Declare queue error: Cannot declare a queue 'queue 'test-test' in vhost '/'' on node 'rabbit@rabbitmq-0.rabbitmq-headless.tp-oiadp-bus.svc.cluster.local': {error,coordinator_unavailable}

I would really appreciate to have your support, any idea would be useful.
Let me know if you need some details.
Thank you all

Reproduction steps

Install rabbitmq bitnami chart with 3 replicas on the cluster ( version: 11.13.0 )
Install kasten k10 on the cluster with the blueprint CRD mentioned above (version: 1.5.0)
Access on rabbitmq UI application and create some stream queue
publish random messages on queues created before
Access on kasten k10 application
Create a policy in kasten to backup the snapshot of volume data-rabbitmq-0
Specify in the kasten policy the prehook and posthook declared in the "volume-snapshot-blueprint" blueprint
Run the policy and run it to create the backup
Reinstall rabbitmq application just to simulate a disaster and delete all the stream queues
Find your application in kasten and restore with the backup created before
Access on the rabbitmq application and you can see the streams are not running
...

Expected behavior

Stream queue should be up and running after the restore of the volume

Additional context

Kubernetes cluster

michaelklishin · 2023-04-28T10:09:09Z

michaelklishin
Apr 28, 2023
Maintainer

We cannot suggest anything without full RabbitMQ logs from all nodes.

coordinator_unavailable can be a transient condition before a stream's coordinator starts and/or a new leader is elected.

A node-level backup tool that just stops the node, takes a volume snapshot (or something like that) and starts it can result in a stream or quorum queue leader re-election.

It's also important to understand what exactly this tool does on restore. RabbitMQ nodes assume they can contact their previously known peers and if that's not the case, they will refuse to boot after a timeout. Specifically this means that their hostnames do not change.

0 replies

michaelklishin · 2023-04-28T10:30:12Z

michaelklishin
Apr 28, 2023
Maintainer

Bitnami chart versioning do not make it easy to figure out what RabbitMQ version they ship but looks like 11.13.0 ships RabbitMQ 3.11.13.

I would not speculate further without logs and stream declaration properties (the screenshot only tells us that it is durable but streams are always durable).

1 reply

andreapulvirenti Apr 28, 2023
Author

yes, the version of RabbitMQ image is 3.11.13, sorry for that.
I'm going to integrate with the logs of the nodes as you requested ASAP

thanks for your rapid attention!

michaelklishin · 2023-04-28T17:56:10Z

michaelklishin
Apr 28, 2023
Maintainer

This is unlikely to be related to what is happening to streams but nonetheless: someone reminds me that a good idea for such backup tools is to mark the node for maintenance first, before shutting them down.

2 replies

andreapulvirenti May 11, 2023
Author

Hi, sorry for the late.
We tried to use some pre and post hook with drain command as well, but doesn't seems to work anyway.

Our idea was to have a backup of a single PVC snapshot of a single node of 3 available (stopping and restarting the application in that period of time) and in case of disaster recovery, spread this snapshot in all available PVC's node replica in order to "paste" all the queue stream data and configuration between nodes.
We found out after many tests that was a bad idea to restore multiple nodes with the same PVC snapshot and expect to have those queue replicated in all nodes of the cluster. That's why the stream queues was restored but always in "down" state and without messages.
So we asked our selfs if, restoring a single node (maybe the leader one), we can expect to have the stream queues replicated in all the cluster's peer. Of course that was a bad idea too reading the documentation.

We had a lot of more tests and we find out that if we make PVC snapshots of all the nodes in the cluster without scale down or scale up anything it works on backup and restore with no problems anymore. That means if you have a cluster with 5 nodes on it, if you take a snapshot of all 5 PVC's and try to restore, everything works fine. If you have a single node in the cluster and backup the his own PVC, you can restore as well.

Sorry if this thoughts may be very silly seen outside, but we was very curious about that mechanism.

kjnilsson May 11, 2023
Maintainer

The stream coordinator is a Ra machine that cannot be restored to a system where the rabbit node name isn't exactly the same as it was for the node the backup was taken. This effectively means that this kind of backup approach cannot be supported at this point. We will update the documentation to reflect current realities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rabbitmq stream coordinator fails to start after the restore of PVC snapshot backup using Kasten k10 #8060

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Rabbitmq stream coordinator fails to start after the restore of PVC snapshot backup using Kasten k10 #8060

Uh oh!

andreapulvirenti Apr 28, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

michaelklishin Apr 28, 2023 Maintainer

Uh oh!

michaelklishin Apr 28, 2023 Maintainer

Uh oh!

andreapulvirenti Apr 28, 2023 Author

Uh oh!

michaelklishin Apr 28, 2023 Maintainer

Uh oh!

andreapulvirenti May 11, 2023 Author

Uh oh!

kjnilsson May 11, 2023 Maintainer

andreapulvirenti
Apr 28, 2023

Replies: 3 comments 3 replies

michaelklishin
Apr 28, 2023
Maintainer

michaelklishin
Apr 28, 2023
Maintainer

andreapulvirenti Apr 28, 2023
Author

michaelklishin
Apr 28, 2023
Maintainer

andreapulvirenti May 11, 2023
Author

kjnilsson May 11, 2023
Maintainer