Rabbitmq server fails to start up (with a HA 'all' + durable queue in it) after a sequence of follower,leader servers abrupt terminations.

We recently encounterd an issue where, the **rabbitmq cluster fails to continue operations after a sequence of fail over events in the rabbitmq cluster servers.**
Environment: kubernetes. Rabbitmq servers are deployed as pods (via k8s statefulset) with persistant volume claims to persist the data on disk.
versions: tested and found the issue to be present across versions **3.7.23 to 3.8.21**

_Below is the sequence of steps to reproduce the same:_

1. Install 2 rabbitmq servers cluster. (call it server0 & server1)
   **ha-promote-on-failure & ha-promote-on-shutdown are defaults**.
2. Create a policy **HA all** for queues starting with "ha" with command from server0: rabbitmqctl set_policy ha-all "^ha" '{"ha-mode":"all"}'
3. Open the management UI and create a queue named ha1 with properties > **Durable = true. Synchronization is manual**.
   Assume **server0 is leader** and **server1 is the follower** of the queue ha1.
4. Use pika client python program and publish 100 records to queue ha1 with properties.
   **connected to server0** (i.e leader of the queue), **delivery_mode=2 (persistant message) , channel.confirm_delivery()** (publish confirms enabled)
5. Now after 100 records are published, **abruptly kill the follower** server process (**not a graceful shutdown**) i.e server1 in this case via **kill -9 pid** command.
6. Data publisher continues to publish records and adds another 500 records to the queue. 
   (total records at this point in the queue is **100 (when both servers were running) + 500 (when only server0 is running)** = 600)
7. Now **abruptly kill the leader server process (not a graceful shutdown)** i.e server0 via kill -9 pid command.
8. Publisher connection to leader server0 fails and stops (since both the brokers in the cluster are down).
9. Now boot up the **server1 (previous follower)** back again. 

Expectation at this point is since the ha-promote-on-failure flag is default **always**, **server1 is expected to become the leader** of the queue and since
server1 was active only till initial 100 records and was unsynchronized (Synchronization is default manual and was not explicitly done at anytime in this sequence), **should boot up normally and serve initial 100 records** to the consumers.

**But, the server1 fails to come up and continously terminates with below logs in the server.**

> > Running boot step worker_pool defined by app rabbit
2021-12-02 14:16:32.104 [info] <0.377.0> Will use 6 processes for default worker pool
2021-12-02 14:16:32.104 [info] <0.377.0> Starting worker pool 'worker_pool' with 6 processes in it
2021-12-02 14:16:32.105 [info] <0.274.0> Running boot step database defined by app rabbit
2021-12-02 14:16:32.115 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2021-12-02 14:17:02.115 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:17:02.116 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2021-12-02 14:17:32.116 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:17:32.117 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2021-12-02 14:18:02.117 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:18:02.118 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2021-12-02 14:18:32.118 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:18:32.119 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2021-12-02 14:19:02.119 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:19:02.120 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2021-12-02 14:19:23.728 [info] <0.60.0> SIGTERM received - shutting down
2021-12-02 14:19:23.729 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:19:23.729 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2021-12-02 14:19:23.729 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.729 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2021-12-02 14:19:23.730 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.730 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2021-12-02 14:19:23.730 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.731 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2021-12-02 14:19:23.734 [error] <0.274.0> Feature flag `quorum_queue`: migration function crashed: {error,{failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-0','rabbit@rabbitcheck-crmq-1'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,121}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,68}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
2021-12-02 14:19:23.735 [error] <0.274.0> Feature flag `user_limits`: migration function crashed: {aborted,{no_exists,rabbit_user,attributes}}
[{mnesia,abort,1,[{file,"mnesia.erl"},{line,361}]},{rabbit_core_ff,user_limits_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,163}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
2021-12-02 14:19:23.736 [error] <0.274.0> Feature flag `virtual_host_metadata`: migration function crashed: {aborted,{no_exists,rabbit_vhost,attributes}}
[{mnesia,abort,1,[{file,"mnesia.erl"},{line,361}]},{rabbit_core_ff,virtual_host_metadata_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,131}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
/usr/lib/rabbitmq/bin/rabbitmq-server: line 89: 30 Killed start_rabbitmq_server "$@"

**And, when looked into the disk snapshot of the queue, idx files are missing in the disk. Only metadata files are available at this point.**
-rw-rw-r--. 1 rabbitmq 10000    20 Dec  2 17:31 .queue_name
drwxrwsr-x. 3 rabbitmq 10000  4096 Dec  2 17:31 ..
drwxrwsr-x. 2 rabbitmq 10000  4096 Dec  2 17:34 .
-rw-rw-r--. 1 rabbitmq 10000 15275 Dec  2 17:34 journal.jif

The rabbitmq cluster **_remains at this stage and never succeeds to boot up successfully_** from this point.
We are able to reproduce this scenario every time with above mentioned steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rabbitmq server fails to start up (with a HA 'all' + durable queue in it) after a sequence of follower,leader servers abrupt terminations. #3836

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rabbitmq server fails to start up (with a HA 'all' + durable queue in it) after a sequence of follower,leader servers abrupt terminations. #3836

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions