WAL checksum validation failed #4252

rajivml · 2022-03-10T07:49:38Z

rajivml
Mar 10, 2022

Hi ,

I have posted this issue on rabbitmq-users google group but it didn't drew anyone attention, so posting the same here https://groups.google.com/g/rabbitmq-users/c/MzHE1Vgpq9U

We recently had a rabbitMQ crash with this error wal_checksum_validation_failure in one of our internal customers environment and here is the pod log for reference.

And to resolve this issue, I had to delete the PVC and recreate the rabbitMQ pods which resulted in loss of data.

So, can you please let us know what scenario could lead this and the steps we can follow to mitigate this without loss of data and the possible ways to prevent this ? Thanks

RabbitMQ Operator Version: 1.5.0
Kubernetes Version: RKE2 v1.21.4+rke2r2
RabbitMQ is backed by: Longhorn PVC's

rabbitmq_crash.log

Answered by michaelklishin

Oct 18, 2023

RabbitMQ 3.8 has reached end of life some 15 months ago. See Upgrading and release notes of each series. Going from 3.8 to 3.12 is likely going to be the easiest with a Blue-Green deployment instead of a 3.8 => 3.9 => 3.10 => 3.11 => 3.12 sequence of upgrades where you enable all feature flags at every step of the way.

View full answer

michaelklishin · 2022-03-10T08:39:59Z

michaelklishin
Mar 10, 2022
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2022-03-10T08:43:35Z

michaelklishin
Mar 10, 2022
Maintainer

All we can tell from the message is that a Raft WAL segment file failed checksum validation. Deleting that specific segment would have helped.

The only relevant discussion I could find was #2817. The conclusion there
is that it looked like a genuine storage failure of some kind (the data in the file did not seem correct).

0 replies

rajivml · 2022-03-10T10:18:22Z

rajivml
Mar 10, 2022
Author

hm, the only thing that 2817 concluded is disk failure, but nothing of that sort happened on this environment and so far I have seen this issue with two customer environments

Any specific logs that I could collect which will help in debugging the issue ?

2 replies

lukebakken Mar 10, 2022
Maintainer

but nothing of that sort happened on this environment

Are you sure of that? Checksums are there for a reason - to catch disk errors.

Any specific logs that I could collect which will help in debugging the issue ?

Well, in #2817 we asked for the WAL files themselves to see if the contents show a pattern.

We also need to know the following:

RabbitMQ version
Erlang version
Exactly what your disk setup is. I can look up "Longhorn PVCs" but it doesn't mean much to me. How does your RabbitMQ data directory map to a physical disk?

cc @kjnilsson

michaelklishin Mar 10, 2022
Maintainer

In #2817 the data was obviously off. There are no details for us to work with here: no data files, no disk details of any kind.

Either provide some counter evidence or we will assume that data file checksums did their job.

rajivml · 2022-03-10T15:31:13Z

rajivml
Mar 10, 2022
Author

Version Details:

rabbitmq@rabbitmq-server-0:/$ rabbitmqctl status
OS: Linux
Uptime (seconds): 24809
Is under maintenance?: false
RabbitMQ version: 3.9.11

rabbitmq@rabbitmq-server-0:/$ rabbitmq-diagnostics erlang_version.
Erlang/OTP 24.2

We use longhorn block storage as our storage abstraction on Kubernetes, so longhorn provisions the PVC's which back RabbitMQ and this customer has multipath enabled on underlying disks if that make any difference

1 reply

kjnilsson Mar 10, 2022
Maintainer

"We use longhorn block storage as our storage abstraction"

Well that is likely where there problem lies. If you supply a corrupted WAL file I can take a look at it but there is a good chance it will be genuinely corrupted just like occurred in #2817

rajivml · 2022-03-10T15:55:07Z

rajivml
Mar 10, 2022
Author

I will try to get the log files by checking with the customer in case they haven't deleted the PVC yet to get rabbitmq back again

1 reply

kjnilsson Mar 10, 2022
Maintainer

Ra has got a few config options that you can try using the advanced config file:

e.g. you can set the wal_sync_method to sync. The default is datasync but there is a chance that longhorn treats them differently. Failing that you can also test setting wal_write_strategy to o_sync to try to enable direct io. I doubt longhorn supports it but it may be worth a try. If it does work it may affect throughput substantially.

wysobj · 2022-07-23T08:19:14Z

wysobj
Jul 23, 2022

I met the same problem. If you use Quorum Queue, node poweroff may lead to this problem, because the tail of WAL file may partial written. For details, please see rabbitmq/ra#284. However, I wrote a python script to truncate the corrupted entries in those WAL file, you can get the script from here: https://github.com/wysobj/rabbitmq-tinker/tree/main/ra_wal. After those entries truncated, your RabbitMQ node can start normally, and you also won't lose your data(at least most of your data!).

4 replies

badihi Oct 18, 2023

@wysobj, I keep coming back here to use your script almost in every node failure that happens.
Am I doing anything wrong?

kjnilsson Oct 18, 2023
Maintainer

What version are you running?

badihi Oct 18, 2023

RabbitMq: 3.8.16
Longhorn: 1.4.0

michaelklishin Oct 18, 2023
Maintainer

RabbitMQ 3.8 has reached end of life some 15 months ago. See Upgrading and release notes of each series. Going from 3.8 to 3.12 is likely going to be the easiest with a Blue-Green deployment instead of a 3.8 => 3.9 => 3.10 => 3.11 => 3.12 sequence of upgrades where you enable all feature flags at every step of the way.

Answer selected by michaelklishin

kjnilsson · 2023-10-18T12:19:32Z

kjnilsson
Oct 18, 2023
Maintainer

Please upgrade.

…

On Wed, 18 Oct 2023 at 13:12, badihi ***@***.***> wrote: 3.8.16 — Reply to this email directly, view it on GitHub <#4252 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJAHFDBWE2R3J45DPSZAATX77BTZAVCNFSM5QL5RJB2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TOMZRGQ2TKMA> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

0 replies

WAL checksum validation failed #4252

Uh oh!

rajivml Mar 10, 2022

Replies: 7 comments · 8 replies

Uh oh!

michaelklishin Mar 10, 2022 Maintainer

Uh oh!

michaelklishin Mar 10, 2022 Maintainer

Uh oh!

rajivml Mar 10, 2022 Author

Uh oh!

Uh oh!

lukebakken Mar 10, 2022 Maintainer

Uh oh!

michaelklishin Mar 10, 2022 Maintainer

Uh oh!

rajivml Mar 10, 2022 Author

Uh oh!

kjnilsson Mar 10, 2022 Maintainer

Uh oh!

rajivml Mar 10, 2022 Author

Uh oh!

kjnilsson Mar 10, 2022 Maintainer

Uh oh!

Uh oh!

wysobj Jul 23, 2022

Uh oh!

Uh oh!

badihi Oct 18, 2023

Uh oh!

kjnilsson Oct 18, 2023 Maintainer

Uh oh!

Uh oh!

badihi Oct 18, 2023

Uh oh!

michaelklishin Oct 18, 2023 Maintainer

Uh oh!

kjnilsson Oct 18, 2023 Maintainer

rajivml
Mar 10, 2022

Replies: 7 comments 8 replies

michaelklishin
Mar 10, 2022
Maintainer

michaelklishin
Mar 10, 2022
Maintainer

rajivml
Mar 10, 2022
Author

lukebakken Mar 10, 2022
Maintainer

michaelklishin Mar 10, 2022
Maintainer

rajivml
Mar 10, 2022
Author

kjnilsson Mar 10, 2022
Maintainer

rajivml
Mar 10, 2022
Author

kjnilsson Mar 10, 2022
Maintainer

wysobj
Jul 23, 2022

kjnilsson Oct 18, 2023
Maintainer

michaelklishin Oct 18, 2023
Maintainer

kjnilsson
Oct 18, 2023
Maintainer