WAL checksum validation failed #4252
-
Hi , I have posted this issue on rabbitmq-users google group but it didn't drew anyone attention, so posting the same here https://groups.google.com/g/rabbitmq-users/c/MzHE1Vgpq9U We recently had a rabbitMQ crash with this error wal_checksum_validation_failure in one of our internal customers environment and here is the pod log for reference. And to resolve this issue, I had to delete the PVC and recreate the rabbitMQ pods which resulted in loss of data. So, can you please let us know what scenario could lead this and the steps we can follow to mitigate this without loss of data and the possible ways to prevent this ? Thanks RabbitMQ Operator Version: 1.5.0 |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 8 replies
-
I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :( |
Beta Was this translation helpful? Give feedback.
-
All we can tell from the message is that a Raft WAL segment file failed checksum validation. Deleting that specific segment would have helped. The only relevant discussion I could find was #2817. The conclusion there |
Beta Was this translation helpful? Give feedback.
-
hm, the only thing that 2817 concluded is disk failure, but nothing of that sort happened on this environment and so far I have seen this issue with two customer environments Any specific logs that I could collect which will help in debugging the issue ? |
Beta Was this translation helpful? Give feedback.
-
Version Details: rabbitmq@rabbitmq-server-0:/$ rabbitmqctl status rabbitmq@rabbitmq-server-0:/$ rabbitmq-diagnostics erlang_version. We use longhorn block storage as our storage abstraction on Kubernetes, so longhorn provisions the PVC's which back RabbitMQ and this customer has multipath enabled on underlying disks if that make any difference |
Beta Was this translation helpful? Give feedback.
-
I will try to get the log files by checking with the customer in case they haven't deleted the PVC yet to get rabbitmq back again |
Beta Was this translation helpful? Give feedback.
-
I met the same problem. If you use Quorum Queue, node poweroff may lead to this problem, because the tail of WAL file may partial written. For details, please see rabbitmq/ra#284. However, I wrote a python script to truncate the corrupted entries in those WAL file, you can get the script from here: https://github.com/wysobj/rabbitmq-tinker/tree/main/ra_wal. After those entries truncated, your RabbitMQ node can start normally, and you also won't lose your data(at least most of your data!). |
Beta Was this translation helpful? Give feedback.
-
Please upgrade.
…On Wed, 18 Oct 2023 at 13:12, badihi ***@***.***> wrote:
3.8.16
—
Reply to this email directly, view it on GitHub
<#4252 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJAHFDBWE2R3J45DPSZAATX77BTZAVCNFSM5QL5RJB2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TOMZRGQ2TKMA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
RabbitMQ 3.8 has reached end of life some 15 months ago. See Upgrading and release notes of each series. Going from 3.8 to 3.12 is likely going to be the easiest with a Blue-Green deployment instead of a 3.8 => 3.9 => 3.10 => 3.11 => 3.12 sequence of upgrades where you enable all feature flags at every step of the way.