Corrupted replication target #12974

jixam · 2022-01-14T13:25:49Z

jixam
Jan 14, 2022

We have experienced silent data corruption on a replication target. I first thought the problem was related to #6224 but this happened on TrueNAS CORE 12.0-U5 which should already have a fix for that issue.

Most files are okay, some are holes (i.e. zeroed) and other files look like this:

    0  L0 DVA[0]=<0:5c51469f000:180000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=100000L/100000P birth=13475L/13475P fill=1 cksum=1ff1f82f1a3dc:f15c71511d9cffad:b597fe552f5d84d:2ffa3c5c0fd3025e
20000  L0 DVA[0]=<0:5c51481f000:180000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=100000L/100000P birth=13475L/13475P fill=1 cksum=205c4c0965f78:bdb3d0334f433a7:813ecf2112abc2c7:e8d6082a7d9859b3

That looks like 1MB of data in a 128k record? (The dataset currently has a 1M recordsize but this particular file has 128k records on the replication source)

I can zfs send the bad dataset, with a resulting stream like this:

WRITE object = 16856 type = 19 checksum type = 7 compression type = 2 flags = 0 offset = 0 logical_size = 131072 compressed_size = 1048576 payload_size = 1048576 props = 207ff07ff salt = 0000000000000000 iv = 000000000000000000000000 mac = 00000000000000000000000000000000

but the kernel panics on zfs receive:

Jan 13 05:00:51 truenas panic: VERIFY3(drrw->drr_logical_size >= drrw->drr_compressed_size) failed (131072 >= 1048576)

Is there some known bug/gotcha that can cause this situation?

Can I fix the situation by destroying the affected dataset or should I rebuild the entire pool?

jixam · 2022-11-16T08:32:29Z

jixam
Nov 16, 2022
Author

We did fix the situation by keeping the pool but sending a fresh copy of the dataset. The corruption has not resurfaced on the new dataset.

I see several closed issues about send/receive corruption with 1M datasets. It was hopefully one of those we hit and the bug is already fixed. Unfortunately, I cannot know for sure since the replication history of the original dataset is not fully documented.

As this discussion has not received any response I am now going to destroy the faulty dataset and further investigation will not be possible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Corrupted replication target #12974

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Corrupted replication target #12974

Uh oh!

jixam Jan 14, 2022

Replies: 1 comment

Uh oh!

jixam Nov 16, 2022 Author

jixam
Jan 14, 2022

jixam
Nov 16, 2022
Author