Database was corrupted because of the federation bindings, unable to recover vhost data #6660
-
Hi everyone, We've found a crash that makes RabbitMQ unable to recover/restart. To solve the issue, we had to delete and create the vhost again. RabbitMQ 3.9.19 on Erlang 23.3.4.14 Error log
We tried reproducing it by restarting RabbitMQ, killing processes, but unfortunately, we were unsuccessful in reproducing the issue. We discovered that all the "missing" exchanges are created by the exchange federation. All exchanges are marked as durable, internal, and auto-delete. So there is no non-durable exchange. We loaded the Mnesia database from the crashed node to have a deeper check inside the database. We found that some binding was still in Based on the crash report in the log, the issue happened when we are trying to look up the exchange in the This code is the same on Probably the recovery procedure should skip bindings where the exchange no longer exists. What do you think? Best Regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Federation has nothing to do with it. It's just that federated exchanges create a higher binding churn, and at some point, a node was shut down before it had a chance to update all routing tables (that's an educated guess). The exception mentions a "semi-durable route". They are bindings between a durable exchange and a transient queue (or vice versa). Using all durable entities should reduce the likelihood of this rare Skipping binding recovery for exchanges that no longer exist makes sense to me. As for Khepri, it no longer uses the same table schema for bindings, so it is quite likely that this specific code path won't exist there at all. |
Beta Was this translation helpful? Give feedback.
Federation has nothing to do with it. It's just that federated exchanges create a higher binding churn, and at some point, a node was shut down before it had a chance to update all routing tables (that's an educated guess).
The exception mentions a "semi-durable route". They are bindings between a durable exchange and a transient queue (or vice versa). Using all durable entities should reduce the likelihood of this rare
issue to more or less zero.
Skipping binding recovery for exchanges that no longer exist makes sense to me.
As for Khepri, it no longer uses the same table schema for bindings, so it is quite likely that this specific code path won't exist there at all.