-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
We recently encounterd an issue where, the rabbitmq cluster fails to continue operations after a sequence of fail over events in the rabbitmq cluster servers.
Environment: kubernetes. Rabbitmq servers are deployed as pods (via k8s statefulset) with persistant volume claims to persist the data on disk.
versions: tested and found the issue to be present across versions 3.7.23 to 3.8.21
Below is the sequence of steps to reproduce the same:
- Install 2 rabbitmq servers cluster. (call it server0 & server1)
ha-promote-on-failure & ha-promote-on-shutdown are defaults. - Create a policy HA all for queues starting with "ha" with command from server0: rabbitmqctl set_policy ha-all "^ha" '{"ha-mode":"all"}'
- Open the management UI and create a queue named ha1 with properties > Durable = true. Synchronization is manual.
Assume server0 is leader and server1 is the follower of the queue ha1. - Use pika client python program and publish 100 records to queue ha1 with properties.
connected to server0 (i.e leader of the queue), delivery_mode=2 (persistant message) , channel.confirm_delivery() (publish confirms enabled) - Now after 100 records are published, abruptly kill the follower server process (not a graceful shutdown) i.e server1 in this case via kill -9 pid command.
- Data publisher continues to publish records and adds another 500 records to the queue.
(total records at this point in the queue is 100 (when both servers were running) + 500 (when only server0 is running) = 600) - Now abruptly kill the leader server process (not a graceful shutdown) i.e server0 via kill -9 pid command.
- Publisher connection to leader server0 fails and stops (since both the brokers in the cluster are down).
- Now boot up the server1 (previous follower) back again.
Expectation at this point is since the ha-promote-on-failure flag is default always, server1 is expected to become the leader of the queue and since
server1 was active only till initial 100 records and was unsynchronized (Synchronization is default manual and was not explicitly done at anytime in this sequence), should boot up normally and serve initial 100 records to the consumers.
But, the server1 fails to come up and continously terminates with below logs in the server.
Running boot step worker_pool defined by app rabbit
2021-12-02 14:16:32.104 [info] <0.377.0> Will use 6 processes for default worker pool
2021-12-02 14:16:32.104 [info] <0.377.0> Starting worker pool 'worker_pool' with 6 processes in it
2021-12-02 14:16:32.105 [info] <0.274.0> Running boot step database defined by app rabbit
2021-12-02 14:16:32.115 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2021-12-02 14:17:02.115 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:17:02.116 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2021-12-02 14:17:32.116 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:17:32.117 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2021-12-02 14:18:02.117 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:18:02.118 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2021-12-02 14:18:32.118 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:18:32.119 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2021-12-02 14:19:02.119 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:19:02.120 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2021-12-02 14:19:23.728 [info] <0.60.0> SIGTERM received - shutting down
2021-12-02 14:19:23.729 [warning] <0.274.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],[rabbit_durable_queue]}
2021-12-02 14:19:23.729 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2021-12-02 14:19:23.729 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.729 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2021-12-02 14:19:23.730 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.730 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2021-12-02 14:19:23.730 [warning] <0.274.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-1','rabbit@rabbitcheck-crmq-0'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}
2021-12-02 14:19:23.731 [info] <0.274.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2021-12-02 14:19:23.734 [error] <0.274.0> Feature flagquorum_queue: migration function crashed: {error,{failed_waiting_for_tables,['rabbit@rabbitcheck-crmq-0','rabbit@rabbitcheck-crmq-1'],{node_not_running,'rabbit@rabbitcheck-crmq-0'}}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,121}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,68}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
2021-12-02 14:19:23.735 [error] <0.274.0> Feature flaguser_limits: migration function crashed: {aborted,{no_exists,rabbit_user,attributes}}
[{mnesia,abort,1,[{file,"mnesia.erl"},{line,361}]},{rabbit_core_ff,user_limits_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,163}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
2021-12-02 14:19:23.736 [error] <0.274.0> Feature flagvirtual_host_metadata: migration function crashed: {aborted,{no_exists,rabbit_vhost,attributes}}
[{mnesia,abort,1,[{file,"mnesia.erl"},{line,361}]},{rabbit_core_ff,virtual_host_metadata_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,131}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1602}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-0-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2269}]},{maps,fold_1,3,[{file,"maps.erl"},{line,410}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2267}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2082}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,644}]}]
/usr/lib/rabbitmq/bin/rabbitmq-server: line 89: 30 Killed start_rabbitmq_server "$@"
And, when looked into the disk snapshot of the queue, idx files are missing in the disk. Only metadata files are available at this point.
-rw-rw-r--. 1 rabbitmq 10000 20 Dec 2 17:31 .queue_name
drwxrwsr-x. 3 rabbitmq 10000 4096 Dec 2 17:31 ..
drwxrwsr-x. 2 rabbitmq 10000 4096 Dec 2 17:34 .
-rw-rw-r--. 1 rabbitmq 10000 15275 Dec 2 17:34 journal.jif
The rabbitmq cluster remains at this stage and never succeeds to boot up successfully from this point.
We are able to reproduce this scenario every time with above mentioned steps.