Database was corrupted because of the federation bindings, unable to recover vhost data #6660

baoanh194 · 2022-12-13T11:42:59Z

baoanh194
Dec 13, 2022

Hi everyone,

We've found a crash that makes RabbitMQ unable to recover/restart. To solve the issue, we had to delete and create the vhost again.

RabbitMQ 3.9.19 on Erlang 23.3.4.14
OS: Windows

Error log

2022-10-10 18:24:53.764000+02:00 [error] <0.644.0> Unable to recover vhost <<"/">> data. Reason {badmatch,{error,not_found}}
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>  Stacktrace [{rabbit_binding,recover_semi_durable_route,3,
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>                              [{file,"rabbit_binding.erl"},{line,112}]},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>              {rabbit_binding,'-recover/2-lc$^1/1-0-',3,
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>                              [{file,"rabbit_binding.erl"},{line,102}]},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>              {rabbit_binding,recover,2,
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>                              [{file,"rabbit_binding.erl"},{line,103}]},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>              {timer,tc,1,[{file,"timer.erl"},{line,166}]},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>              {rabbit_vhost,recover,1,[{file,"rabbit_vhost.erl"},{line,63}]},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>              {rabbit_vhost_process,init,1,
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>                                    [{file,"rabbit_vhost_process.erl"},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>                                     {line,43}]},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>              {gen_server2,init_it,6,[{file,"gen_server2.erl"},{line,565}]},
2022-10-10 18:24:53.764000+02:00 [error] <0.644.0>              {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>     supervisor: {<0.642.0>,rabbit_vhost_sup_wrapper}
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>     errorContext: start_error
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>     reason: {badmatch,{error,not_found}}
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>     offender: [{pid,undefined},
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>                {id,rabbit_vhost_process},
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>                {mfargs,{rabbit_vhost_process,start_link,[<<"/">>]}},
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>                {restart_type,permanent},
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>                {shutdown,300000},
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>                {child_type,worker}]
2022-10-10 18:24:53.765000+02:00 [error] <0.642.0>
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>   crasher:
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     initial call: rabbit_vhost_process:init/1
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     pid: <0.644.0>
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     registered_name: []
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     exception exit: {badmatch,{error,not_found}}
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       in function  gen_server2:init_it/6 (gen_server2.erl, line 600)
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     ancestors: [<0.642.0>,rabbit_vhost_sup_sup,rabbit_sup,<0.226.0>]
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     message_queue_len: 0
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     messages: []
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     links: [<0.642.0>,<0.677.0>]
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     dictionary: [{{xtype_to_module,topic},rabbit_exchange_type_topic}]
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     trap_exit: true
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     status: running
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     heap_size: 318187
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     stack_size: 28
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     reductions: 43491
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>   neighbours:
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>     neighbour:
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       pid: <0.677.0>
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       registered_name: []
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       initial call: gatherer:init/1
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       current_function: {gen_server2,process_next_msg,1}
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       ancestors: [<0.644.0>,<0.642.0>,rabbit_vhost_sup_sup,rabbit_sup,
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>                   <0.226.0>]
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       message_queue_len: 0
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       links: [<0.644.0>]
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       trap_exit: false
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       status: waiting
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       heap_size: 233
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       stack_size: 10
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       reductions: 145
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>       current_stacktrace: [{gen_server2,process_next_msg,1,
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>                                [{file,"gen_server2.erl"},{line,673}]},
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>                   {proc_lib,init_p_do_apply,3,
2022-10-10 18:24:53.765000+02:00 [error] <0.644.0>                             [{file,"proc_lib.erl"},{line,226}]}]

We tried reproducing it by restarting RabbitMQ, killing processes, but unfortunately, we were unsuccessful in reproducing the issue.

We discovered that all the "missing" exchanges are created by the exchange federation. All exchanges are marked as durable, internal, and auto-delete. So there is no non-durable exchange.

We loaded the Mnesia database from the crashed node to have a deeper check inside the database. We found that some binding was still in rabbit_durable_route table but the exchange is not in rabbit_durable_exchange table. Somehow, the data are not synchronized between the two tables.

Based on the crash report in the log, the issue happened when we are trying to look up the exchange in the rabbit_exchange table (Line 112): https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_binding.erl#L112

This code is the same on master branch, and a similar issue was seen on the khepri branch (). So we assume the problem affects both versions.
#5100

Probably the recovery procedure should skip bindings where the exchange no longer exists. What do you think?

Best Regards,
Bao

Answered by michaelklishin

Dec 13, 2022

Federation has nothing to do with it. It's just that federated exchanges create a higher binding churn, and at some point, a node was shut down before it had a chance to update all routing tables (that's an educated guess).

The exception mentions a "semi-durable route". They are bindings between a durable exchange and a transient queue (or vice versa). Using all durable entities should reduce the likelihood of this rare
issue to more or less zero.

Skipping binding recovery for exchanges that no longer exist makes sense to me.

As for Khepri, it no longer uses the same table schema for bindings, so it is quite likely that this specific code path won't exist there at all.

View full answer

michaelklishin · 2022-12-13T11:53:00Z

michaelklishin
Dec 13, 2022
Maintainer

Federation has nothing to do with it. It's just that federated exchanges create a higher binding churn, and at some point, a node was shut down before it had a chance to update all routing tables (that's an educated guess).

The exception mentions a "semi-durable route". They are bindings between a durable exchange and a transient queue (or vice versa). Using all durable entities should reduce the likelihood of this rare
issue to more or less zero.

Skipping binding recovery for exchanges that no longer exist makes sense to me.

As for Khepri, it no longer uses the same table schema for bindings, so it is quite likely that this specific code path won't exist there at all.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Database was corrupted because of the federation bindings, unable to recover vhost data #6660

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Database was corrupted because of the federation bindings, unable to recover vhost data #6660

Uh oh!

Uh oh!

baoanh194 Dec 13, 2022

Replies: 1 comment

Uh oh!

michaelklishin Dec 13, 2022 Maintainer

baoanh194
Dec 13, 2022

michaelklishin
Dec 13, 2022
Maintainer