ZFS give incorrect error messages, pretends total data loss and urges user to reformat disks, when actually there is no data problem. #17272

kernschmelze · 2025-04-24T22:26:16Z

kernschmelze
Apr 24, 2025

System A: Debian 12.9, zfs-2.1.11-1+deb12u1 zfs-kmod-2.1.11-1+deb12u1
System B: Ubuntu 24.04.02 Server LTS zfs-2.2.2-0ubuntu9.2 zfs-kmod-2.2.2-0ubuntu9.1

I am attempting to get some data transported physically from a computer to another.
On system A, the data was packed onto a transport pool on the 4TB drive via a lengthy local zfs send/receive, and exported afterwards, before the drive was moved to system B.

I expected to just import the pool on system B and access the data.
But, when I put that drive into system B, I got an error message a la "Oops! Your precious data now is irretrievably lost forever! Your only remaining option is to format the drive!".

When I researched I found out that this is a longstanding issue whose PR was apparently closed without resolving action:

So when I correct the symbolic links so that they actually represent what zdb -l displays in the "path:" section, I get another "joke" message. This time that one: The device listed as FAULTED with ‘corrupted data’ cannot be opened due to a corrupt label. ZFS will be unable to use the pool, and all data within the pool is irrevocably lost.

This message again is IMHO can not be correct.
Because, putting back the drive in system A, importing it again, inspecting and scrubbing did not reveal anything unusual!

Such behaviours I never observed on FreeBSD.
Totally misleading and incorrect error messages!
Even urging the user to actually irrevocably delete their data!

Right now I am doing the zfs send/receive action again, this time on a pool created using -d, and using a manually-created link as path, instead of the links in /dev/disk, just for finding out whether this behaviour could be caused by a gazillion of features active, or some issue with /dev/... paths. But, again, on FreeBSD I didn't have such an issue in years. IMHO such behavior should not happen on Linux either.

Any idea why on Linux ZFS gives such user-shocking grotesquely wrong messages suggesting total data loss, and even suggestis the user to format the drives, causing actual data loss, when in fact there are no data errors? I just don't get it...

Again, the only thing that was different in system B was the path position.
But, isn't it supposed to do so? At least when the disks are imported on a different system?

Refusing to import the pool at its new hardware path and pretend total data loss just because of this minor (if at all) label inconsistency is IMHO not an OK behavior... what do you think? what did I miss or get wrong?

IvanVolosyuk · 2025-04-25T03:03:50Z

IvanVolosyuk
Apr 25, 2025

Can you include zpool status [yourpool] on original system and zpool import output in target system?
Did you try to import zpool import -d /dev/sdXXNN yourpool ?
zdb -l /dev/sdXXNN can shed more light as well.

0 replies

kernschmelze · 2025-04-25T06:25:20Z

kernschmelze
Apr 25, 2025
Author

zpool status on the original system was completely normal, just indicating that on the recent scrub there weren't found/corrected any issue.

zpool import on the target system:
# zdb -l /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR

LABEL 0

version: 5000
name: 'zpool_meta'
state: 1
txg: 66107
pool_guid: 1353412334590388366
errata: 0
hostid: 1329746763
hostname: 'exile'
top_guid: 8008196159326087980
guid: 8008196159326087980
vdev_children: 1
vdev_tree:
    type: 'disk'
    id: 0
    guid: 8008196159326087980
    path: '/dev/sdg'
    phys_path: 'id1,enc@n3061686369656d30/type@0/slot@2/elmdesc@Slot_01'
    whole_disk: 1
    metaslab_array: 256
    metaslab_shift: 34
    ashift: 12
    asize: 4000782221312
    is_log: 0
    DTL: 7090
    create_txg: 4
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data
labels = 0 1 2 3

# zpool import zpool_meta
cannot import 'zpool_meta': no such pool available
# zpool import -d /dev/sde zpool_meta
cannot import 'zpool_meta': no such pool available
# zpool import -d /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR
pool: zpool_meta
id: 1353412334590388366
state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
config:

zpool_meta                           UNAVAIL  insufficient replicas
  ata-HGST_HUS726T4TALE6L4_V6H9TVXR  UNAVAIL  invalid label

#

As I told, I could not believe that this garbage output about "corrupted data" is actually true, I took the drive back to the old system and examined it.
As said, zpool status after completed scrub is totally normal, no hint to anything unusual, so I did not bother to keep the zpool status output.

I followed the advice to place fixed device links so the unreliable Linux device naming can be avoided.
This had an effect on the error output, but did not help to work around the apparent ZFS bugs on Linux...

# mkdir /zvdevs
# ln -s /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR /zvdevs/zpool_meta
# zdb -l /zvdevs/zpool_meta

LABEL 0

version: 5000
name: 'zpool_meta2'
state: 1
txg: 66487
pool_guid: 1353412334590388366
errata: 0
hostid: 1329746763
hostname: 'exile'
top_guid: 8008196159326087980
guid: 8008196159326087980
vdev_children: 1
vdev_tree:
    type: 'disk'
    id: 0
    guid: 8008196159326087980
    path: '/zvdevs/zpool_meta'
    phys_path: 'id1,enc@n3061686369656d30/type@0/slot@2/elmdesc@Slot_01'
    whole_disk: 1
    metaslab_array: 256
    metaslab_shift: 34
    ashift: 12
    asize: 4000782221312
    is_log: 0
    DTL: 7090
    create_txg: 4
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data
labels = 0 1 2 3

# zpool import zpool_meta2
cannot import 'zpool_meta2': no such pool available
# zpool import -d /zvdevs/zpool_meta
pool: zpool_meta2
id: 1353412334590388366
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-72
config:

zpool_meta2           FAULTED  corrupted data
  /zvdevs/zpool_meta  ONLINE

#

This all boils down to the question:
Can either be the bugs on Linux ZFS be fixed or worked around, so that the system does no longer give such grossly wrong error messages instead of just importing the pool at the physical path where it was found?

Refusing to import and give such appallingly wrong messages about the status of the precious data, only because this physical path is different from what has been stored in the label, this should imho be considered as a severe bug.

0 replies

kernschmelze · 2025-04-25T10:19:36Z

kernschmelze
Apr 25, 2025
Author

Next attempt was to try whether things go differently if I use manually-created device links.
Thus I created a directory zpooldevs that would have to exist on every machine that needs to use this pool.

Thus, on the target system I entered:

# ln -s /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR-part1 /zpooldevs/bak_4T_wwn_5000cca097d28c32
# zdb -l /zpooldevs/bak_4T_wwn_5000cca097d28c32

LABEL 0

version: 5000
name: 'bak_4T_wwn_5000cca097d28c32'
state: 1
txg: 8352
pool_guid: 16703327145615827938
errata: 0
hostid: 1329746763
hostname: 'exile'
top_guid: 3290381994946855134
guid: 3290381994946855134
vdev_children: 1
vdev_tree:
    type: 'disk'
    id: 0
    guid: 3290381994946855134
    path: '/zpooldevs/bak_4T_wwn_5000cca097d28c32'
    whole_disk: 0
    metaslab_array: 256
    metaslab_shift: 34
    ashift: 12
    asize: 4000781172736
    is_log: 0
    create_txg: 4
features_for_read:
labels = 0 1 2 3

# zpool import bak_4T_wwn_5000cca097d28c32
cannot import 'bak_4T_wwn_5000cca097d28c32': no such pool available
# zpool import -d /zpooldevs/bak_4T_wwn_5000cca097d28c32
pool: bak_4T_wwn_5000cca097d28c32
id: 16703327145615827938
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
'compatibility' property is set.)
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit 'zpool upgrade'.
config:

bak_4T_wwn_5000cca097d28c32               ONLINE
  /zpooldevs/bak_4T_wwn_5000cca097d28c32  ONLINE

# zpool import bak_4T_wwn_5000cca097d28c32
cannot import 'bak_4T_wwn_5000cca097d28c32': one or more devices is currently unavailable
#

So now there is no more phys_path in the label,.
Personally I guess that the issues' underlying cause is ignoring the "path" property and only using "phys_path", and rather erroring out than using the user-defined/user-definable "path", if the "phys_path" property is missing.

Next step is to prepare the data transporter drive on FreeBSD and try again
Maybe Linux ZFS behaves differently if the partition type is a504 (FreeBSD ZFS) than bf01, the apparent ZoL default...

0 replies

kernschmelze · 2025-04-25T10:46:10Z

kernschmelze
Apr 25, 2025
Author

Finally, this thread showed a "solution". Thanks @amotin

Well, I do not consider really intuitive that doing a wipefs is apparently prerequisite before being able to importing...

Maybe the error message links mentioned in the entry post could be updated accordingly, instead of effectively doing bad April Fools Day jokes to the users by withhelding that unintuitive fix and instead falsely telling them their data has gone forever?

0 replies

kernschmelze · 2025-04-25T19:50:33Z

kernschmelze
Apr 25, 2025
Author

Nooooo!

Unfortunately this "solution" does not work always.

When putting the transport drive into the target system again after having loaded it with another dataset, I now get told by zpool import -d that I can import the pool using its name or its pool GUID...
When using the name, it always says "cannot import '': no such pool available" !!!

But it imports the pool when using the pool GUID !!

1 reply

amotin Apr 28, 2025
Collaborator

But it imports the pool when using the pool GUID !!

I wonder if it means you have some disks with labels of that pool name, and it just tries to import the wrong and obviously incomplete pool.

ZFS give incorrect error messages, pretends total data loss and urges user to reformat disks, when actually there is no data problem. #17272

Uh oh!

kernschmelze Apr 24, 2025

Replies: 5 comments · 1 reply

Uh oh!

Uh oh!

IvanVolosyuk Apr 25, 2025

Uh oh!

Uh oh!

kernschmelze Apr 25, 2025 Author

zpool import on the target system: # zdb -l /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR

LABEL 0

# mkdir /zvdevs # ln -s /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR /zvdevs/zpool_meta # zdb -l /zvdevs/zpool_meta

LABEL 0

Uh oh!

kernschmelze Apr 25, 2025 Author

# ln -s /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR-part1 /zpooldevs/bak_4T_wwn_5000cca097d28c32 # zdb -l /zpooldevs/bak_4T_wwn_5000cca097d28c32

LABEL 0

Uh oh!

Uh oh!

kernschmelze Apr 25, 2025 Author

Uh oh!

kernschmelze Apr 25, 2025 Author

Uh oh!

amotin Apr 28, 2025 Collaborator

kernschmelze
Apr 24, 2025

Replies: 5 comments 1 reply

IvanVolosyuk
Apr 25, 2025

kernschmelze
Apr 25, 2025
Author

zpool import on the target system:
# zdb -l /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR

# mkdir /zvdevs
# ln -s /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR /zvdevs/zpool_meta
# zdb -l /zvdevs/zpool_meta

kernschmelze
Apr 25, 2025
Author

# ln -s /dev/disk/by-id/ata-HGST_HUS726T4TALE6L4_V6H9TVXR-part1 /zpooldevs/bak_4T_wwn_5000cca097d28c32
# zdb -l /zpooldevs/bak_4T_wwn_5000cca097d28c32

kernschmelze
Apr 25, 2025
Author

kernschmelze
Apr 25, 2025
Author

amotin Apr 28, 2025
Collaborator