Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others #14793

admnd · 2023-04-25T13:20:51Z

admnd
Apr 25, 2023

Originally started as a bug, but after investigations and comments it is definitely more a hardware issue related to ZFS than a ZFS bug so I open a general discussion here, free feel to put constructive observations/ideas/workarounds/suggestions.

TL;DR: Some NVME sticks just crash with ZFS, probably due to the fact they are unable to sustain I/O bursts. It is not clear why this happens, the controller might just crash or a combination of firmware/BIOS/hardware makes it unstable/crash when used in a ZFS pool.

Hardware

OS: Gentoo Linux x86/64 with kernel 6.2.12 and ZFS 2.1.11.
Hardware:
- CPU: AMD Ryzen 7950X
- Motherboard: Asus TUF Gaming X670E-Plus WiFi (upgraded to the latest available BIOS => 1410 as of 05/25/2023)
- 3x NVMe WD Black SN770 2TB with latest firmware as of 05/25/2023 (731100WD) configured with 4K sectors
- PSU: MSI 850W

Issue observed

My system zpool is composed of a single RAID-Z1 VDEV composed of 3x WD Black SN770 2TB them selves configured in 4K logical sectors (I did not test with 512b sectors to see if the issue still happens....yet). The VDEV uses LZ4 compression, is not encrypted neither the underlying modules (they do not support that), standard 128K stripes are used. No L2ARC cache used. System has plenty of free RAM so no RAM underpressure.

Under "normal" daily usage I did not experience anything, the zpool is regularly scrubbed and nothing to report: no checksum error, no frozen tasks, no crash, nothing, the pool completes all scrubbings wonderfully well. The machine also experience no freeze or kernel crashes/"oopses", no stuck tasks (I have had reported an issue with auditd here a couple of weeks ago but this guy is now inactive, see bug #14697). Even "emerging" big stuff like dev-qt/qtwebengine with 32 CMake jobs in parallel or reemerging the whole system from scratch with 32 parallel tasks with heavy packages rebuilt at the same time succeeds. No crashes.

However, if I use zfs send to make a backup of the system datasets on a local TrueNAS box over a 10GbE link this is another story: most of the time one of the NVMe modules randomly crash. The issues also happens at different times in the data transfer: sometimes the issue appears after 12Gb, sometimes after 78Gb, sometimes after 93 Gb and so on. If I am lucky, sometimes it completes the operation successfully (less than a quarter of the time). Itchy and annoying. I have managed also to reproduce it with rsync-ing a dataset on an empty new one in the same pool also this happens more rarely. The TrueNAS box and network are out of concern as they run smoothly and as I can reproduce the issue locally by sending the ZFS stream in /dev/null (zfs send .... | cat > /dev/null).

When the crash happens, the following trace appears in the kernel logs:

[430771.216723] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[430771.216727] nvme nvme2: Does your device have a faulty power saving mode enabled?
[430771.216729] nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[430771.266732] nvme 0000:13:00.0: enabling device (0000 -> 0002)
[430771.266814] nvme nvme2: Disabling device after reset failure: -19
[430771.283392] I/O error, dev nvme2n1, sector 1812765936 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=1 offset=928127770624 size=16384 flags=180880
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=1 offset=1394183585792 size=24576 flags=180880
[430771.283397] zio pool=rpool vdev=/dev/nvme2n1p1 error=5 type=2 offset=1575062740992 size=4096 flags=180880 [430771.283399] nvme2n1: detected capacity change from 3907029168 to 0

At this point, if I am lucky enough, I can manage to bring it back to life using a sledgehammer:

echo 1 > /sys/bus/pci/devices/0000\:12\:00.0/remove
echo 1 > /sys/bus/pci/rescan

If the faulted device reappears the zpool becomes ONLINE again and completes its resilvering (a couple of KB or MB). In the worst case, another one NVMe also drops off the pool which becomes suspended so I have to powercycle the machine or push its reset button. Of course, doing a nvme list at this point either completely freezes either lists the two remaining NVMe modules, depending on what is alive.

My best guess so far is that the Western Digital SN 770 modules controller is not not beefy enough to handle a burst of I/O requests (knowing they have no DRAM cache) so it is put on its knees and become so unresponsive that it is unable to complete a reset request on its own (no AER reported in logs BTW). As not always the same module crashes, they do not seems be all defective or I am extremely unlucky. Pool scrubbing might by a bit lighter for the controller so the scrubs/resilvers work without any issue (maximum observed speed observe is around 4.5~5 GB/s when scrubbing the pool according to zpool status).

What has been tried so far

Several things! Without any improvements unfortunately:

As suggested in the error, put nvme_core.default_ps_max_latency_us=0 pcie_aspm=off on the kernel command-line;
Move the NVMe around in different slots (temperatures seems reasonable and they all have heatsinks)
Playing around with some zfs kernel modules parameters: lowering values of zfs_vdev_sync_read_min_active,zfs_vdev_sync_read_max_active and their async counterpart (I used the same values set as defaults for fs_vdev_scrub_max_active and fs_vdev_scrub_max_active) ;
Throttling with throttle : zfs send ... | throttle -M 300 | ...
Tinkering with the blkio cgroup
Running a short S.M.A.R.T. test: nothing special to say, all of the three NMVe modules pass it.
Put the whole machine hardware settings on their BIOS defaults (No PBO, no RAM overclocking)
Memtesting the RAM (3 passes, no errors)
rsync-ing the system dataset on a virtual disk over iSCSI (no crash! yeah! impractical however)
zfs send from a FreeBSD live media : FreeBSD allocates a 200MB host buffer for each module but unfortunately no more success and a zfs send also hangs :/
PCIe 3.0 & 2.0 enforced on all M.2 slots => still crashes
PCIe power management set at "off" in BIOS/UEFI.

Some thoughts / ideas of tests to try

Use 512b sectors (pool has to be destroyed)
Swap the WD Black SN 850 modules of my secondary machine with those and see if this solves the issue on this machine (while being functional on the other machine)
Burn a candle

Is there a "ZFS native" way to throttle I/O operations in the case of doing a zfs send?

Has anybody here experienced something like this? If so, what are the other brands/models subject to a similar issue?

admnd · 2023-04-25T21:07:52Z

admnd
Apr 25, 2023
Author

Found something interesting in a proposed patch in a discussion whose topic was "[PATCH] nvme-pci: fix host memory buffer allocation size" dating of may 10th 2022. The starting point of the discussion start here => https://www.spinics.net/lists/kernel/msg4339024.html

At some point (https://www.spinics.net/lists/kernel/msg4352567.html), it is mentioned that:

WD SN770 NVMe are problematic (the author experience the very same freezes than me but does not mentions ZFS so I guess that he uses a single standalone drive with something else than ZFS)
Switching the I/O scheduler to "mq-deadline" improved the situation without solving it completely.

Also in a subsequent message ( https://www.spinics.net/lists/kernel/msg4372632.html ) it is also mentioned that the situation has improved drastically with the patch.

And another point of the discussion about having the Host Memory Buffer of just 32MB. According to my logs, I have the same allocation:

[    3.264207] nvme nvme2: pci function 0000:08:00.0
[    3.264207] nvme nvme1: pci function 0000:0e:00.0
[    3.264207] nvme nvme0: pci function 0000:04:00.0
[    3.302554] nvme nvme2: allocated 32 MiB host memory buffer.
[    3.303343] nvme nvme0: allocated 32 MiB host memory buffer.
[    3.303721] nvme nvme1: allocated 32 MiB host memory buffer.
[    3.306596] nvme nvme2: 32/0/0 default/read/poll queues
[    3.307029] nvme nvme0: 32/0/0 default/read/poll queues
[    3.307622] nvme nvme1: 32/0/0 default/read/poll queues

For the record, here is excerpts of some messages:

Taken from https://www.spinics.net/lists/kernel/msg4352567.html :

On my current setup (WD SN770 on ThinkPad X1 Carbon Gen9) frequently the NVME
controller stops responding. Switching from no scheduler to mq-deadline reduced
this but did not eliminate it.
Since switching to HMB of 1 * 200MiB and no scheduler this did not happen anymore.
(But I'll need some more time to gain real confidence in this)

Initially I assumed that the PAGE_SIZE * MAX_ORDER_NR_PAGES was indeed
meant as a minimum for DMA allocation.
As that is not the case, removing the min() completely instead of the max() I
proposed would obviously be the correct thing to do.

Taken from https://www.spinics.net/lists/kernel/msg4372632.html :

So this patch dramatically improves the stability of my disk.
Without it and queue/scheduler=none the controller stops responding after a few
minutes. mq-deadline reduced it to every few hours.
With the patch it happens roughly once a week.

Current parameters for the nvme kernel modules on my system are on their defaults:

parm:           use_threaded_interrupts:int => 0
parm:           use_cmb_sqes:use controller's memory buffer for I/O SQes (bool) => Y
parm:           max_host_mem_size_mb:Maximum Host Memory Buffer (HMB) size per controller (in MiB) (uint) => 128
parm:           sgl_threshold:Use SGLs when average request segment size is larger or equal to this size. Use 0 to disable SGLs. (uint) => 32768
parm:           io_queue_depth:set io queue depth, should >= 2 and < 4096 => 1024
parm:           write_queues:Number of queues to use for writes. If not set, reads and writes will share a queue set. => 0
parm:           poll_queues:Number of queues to use for polled IO. => 0
parm:           noacpi:disable acpi bios quirks (bool) => N

Going though the code of drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c (checked with a 6.2.12 Linux kernel) suggests that the famous patch has not been applied because the "min_t" is still there:

static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
{
        u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
        u64 hmminds = max_t(u32, dev->ctrl.hmminds * 4096, PAGE_SIZE * 2);
        u64 chunk_size;

        /* start big and work our way down */
        for (chunk_size = min_chunk; chunk_size >= hmminds; chunk_size /= 2) {
                if (!__nvme_alloc_host_mem(dev, preferred, chunk_size)) {
                        if (!min || dev->host_mem_size >= min)
                                return 0;
                        nvme_free_host_mem(dev);
                }
        }

        return -ENOMEM;
}

The patch in question is mentioned at the very beginning of the discussion and is this one:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aacf1c0d5a5..0546523cc20b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2090,7 +2090,7 @@ static int __nvme_alloc_host_mem(struct nvme_dev *dev, u64 preferred,
 
 static int nvme_alloc_host_mem(struct nvme_dev *dev, u64 min, u64 preferred)
 {
-	u64 min_chunk = min_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
+	u64 min_chunk = max_t(u64, preferred, PAGE_SIZE * MAX_ORDER_NR_PAGES);
 	u64 hmminds = max_t(u32, dev->ctrl.hmminds * 4096, PAGE_SIZE * 2);
 	u64 chunk_size;

Another related thread is here => https://lore.kernel.org/linux-nvme/[email protected]/
Quoting:

I am wondering about the calculation of the NVMe Host Memory Buffer sizes.
It seems to me that the current algorithm to calculate this size does not lead
to an optimal result.

Hardware information:
mn : WD_BLACK SN770 1TB
fr : 731030WD
hmpre : 51200 (limited by max_host_mem_size_mb to 32768 -> 128MiB)
hmmin : 823
hmminds : 0
hmmaxd : 8

To me this looks like the disk wants 200MiB allocated that can be described in
eight descriptors.
However the kernel log has the following entry:

[ 8.981685] nvme nvme0: allocated 32 MiB host memory buffer.

Tracing through drivers/nvme/host/pci.c the following happens:

The loop in nvme_alloc_host_mem() is only entered once.
min: 3371008
preferred: 134217728
min_chunk: 4194304
chunk_size: 4194304

Now in __nvme_alloc_host_mem() the loop is called the eight times for hmmaxd,
each time allocating 4194304 bytes (4 MiB).
The end result is that a total of 32MiB of Host Memory Buffer are allocated
which is the bare minimum instead of the 200 MiB that are preferred and
available.

It seems that the logic to calculate min_chunk in nvme_alloc_host_mem() starts
with a too small value.

All of this is on a normal x86 laptop with plenty of system memory.
It's reproducible with current git (46cf2c613f4b10eb12f749207b0fd2c1bfae3088)
and 5.17.4.

0 replies

admnd · 2023-04-25T23:23:25Z

admnd
Apr 25, 2023
Author

Above patch tried, but in my case, worsens the issue :( The crash happens much more earlier than before.
Fiddling around with parameters of nvme.ko, I managed to have a higher allocation of 200 MB with nvme.max_host_mem_size_mb=512 + the above patch applied.

1 reply

foolab Aug 15, 2025

And just now, I got an error with SN850 :-(

[8336605.935521] nvme nvme1: I/O tag 820 (4334) opcode 0x0 (I/O Cmd) QID 2 timeout, aborting req_op:FLUSH(2) size:0
[8336605.935579] nvme nvme1: Abort status: 0x0
[8336606.571489] nvme nvme1: I/O tag 104 (5068) opcode 0x0 (I/O Cmd) QID 6 timeout, aborting req_op:FLUSH(2) size:0
[8336606.571539] nvme nvme1: Abort status: 0x0
[8336607.079686] nvme nvme1: I/O tag 655 (a28f) opcode 0x0 (I/O Cmd) QID 11 timeout, aborting req_op:FLUSH(2) size:0
[8336607.079740] nvme nvme1: Abort status: 0x0
[8336621.023691] nvme nvme1: I/O tag 871 (e367) opcode 0x0 (I/O Cmd) QID 3 timeout, aborting req_op:FLUSH(2) size:0
[8336621.023738] nvme nvme1: Abort status: 0x0
[8336636.143857] nvme nvme1: I/O tag 820 (4334) opcode 0x0 (I/O Cmd) QID 2 timeout, reset controller
[8336706.776713] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[8336716.836834] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[8336716.836973] nvme nvme1: Disabling device after reset failure: -19
[8336716.898251] I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[8336716.898251] I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[8336716.898255] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898256] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898260] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898273] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146255872 size=118784 flags=3145856
[8336716.898274] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145731584 size=131072 flags=3145856
[8336716.898274] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146509824 size=36864 flags=3145856
[8336716.898275] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145600512 size=131072 flags=3145856
[8336716.898276] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145862656 size=131072 flags=3145856
[8336716.898276] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145260544 size=20480 flags=3145856
[8336716.898279] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.898286] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584145993728 size=131072 flags=3145856
[8336716.898286] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=446787076096 size=16384 flags=3145856
[8336716.898289] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146378752 size=131072 flags=3145856
[8336716.901291] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=2 offset=584146124800 size=126976 flags=3145856
[8336716.901302] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.901320] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304
[8336716.920808] zio pool=zroot vdev=/dev/nvme1n1p2 error=5 type=5 offset=0 size=0 flags=2098304

admnd · 2023-04-25T23:54:25Z

admnd
Apr 25, 2023
Author

Basically at this point, I am out of options with those sticks. Those are a replacement for a trio of ADATA Gammix S70 Blade which were also problematic because their namespace had a bad value for EUI64: Basically all were all set to eui64=0000000000000000 which made the system totally confused about who was who.

So my only option at this point is to get another model :/ Perhaps I will keep them for a much-less intensive use.

Reality is: not all NVMe hardware can play nicely with ZFS. It seems that investing in higher end of hardware is not an option, especially with ZFS. I won't ever consider switching them back to 512b sectors, I don't think this will solve the issue and if ever it solves it, there is a significant performance penalty.

Hoping my hours of investigations would avoid someone wasting money in junk hardware. It is a bit disappointing that this junk is coming from a well-known brand.

PS: Free feel to further elaborate. I will post if I get something new on this.

0 replies

IvanVolosyuk · 2023-04-26T03:45:03Z

IvanVolosyuk
Apr 26, 2023

I would try to replace the PSU with another one and probably 1000W one. Often mysterious problems end up with replacing faulty PSU.

…

On Wed, Apr 26, 2023 at 9:23 AM admnd ***@***.***> wrote: Above patch tried, but in my case, worsens the issue :( The crash happens much more early than before. Fiddling around with parameters of nvme.ko, I managed to have a higher allocation of 200 MB with nvme.max_host_mem_size_mb=512 + the above patch applied. — Reply to this email directly, view it on GitHub <#14793 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HOVYHJWDVAHYS4RWYDXDBMHPANCNFSM6AAAAAAXLAAQ7E> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

2 replies

admnd Apr 26, 2023
Author

Thank you for this suggestion. It is still plausible and I keep it. However, it is very unlikely that it is the cause here for mainly two reasons: 1. The PSU is not even at half load, 2. I would have seen other symptoms while the machine is on very high load or while a scrub is running, 3. someone experienced a similar issue with other hardware (and managed to fix it).

Having to replace the PSU means throwing a significant cash amount for a test that might not be a success. Better to save the money for beefier modules. But if I manage to get one in a way or another, worth a try. I might try swap the PSU for a trusted one I have in my secondary machine (not 1000W however) as I cannot reproduce the issue on it (3x SN850, working #1 with ZFS since day 1).

Manawyrm Jun 22, 2023

NVMe storage uses 3.3V supply voltage, which gets created locally on the mainboard from the 12V (sometimes 5V) supply rails on basically all mainboards. The 3.3V rail on the ATX connector is unused on most boards.
If that doesn't work properly, the mainboard is at fault.

Flaaxxx · 2023-04-26T06:58:12Z

Flaaxxx
Apr 26, 2023

This might be a longshot, but where have you connected your NVMe? Did you use the onboard slots or a riser card with bifurcation? And if you used the onboard slots which ones did you use?

From the Manual you can see one of the slots shares bandwith with the Sata Ports if theres anything in there it could cause a Problem. Further x670 daisy chanins 2x the x670 chipset to give more connectivity. A Guess off mine could be that this issue could be cause by limited bandwith between chipsets and the CPU which might cause the controller to look like its dropping.

My suggestion to troubleshoot this, is to get a bifurcating riser card put it in the 16x Slot and have all the NVMes directly connected to the CPU. This would eliminate going over the Chipsets.

Unfortunatly ASUS has no blockdiagram of the Board and where which PCIe Lanes go with which speed. But I would see if limiting the speed of the drives could also be causing this issue. PCIe Switching link speed caused me a lot of headaches with my rx5700 xt GPU. It caused some weird issue of it disconnecting crashing the drivers etc. So pretty similar to what you experience.

Those 2 would be my guesses for this issue.

1 reply

admnd Apr 26, 2023
Author

Very savvy, thank you. I have no riser here to try your first suggestion (as 7950X has a built in GPU I can pull out the dGPU) this week. But what I can do is to rebuild a pool with 2x NVMe in mirror rather than 3 in RAID-Z1 and see what would happen.

Indeed, the description is a bit hidden in the technical details:
https://www.asus.com/ca-en/motherboards-components/motherboards/tuf-gaming/tuf-gaming-x670e-plus-wifi/techspec/

The paragraph "Storage" says:

AMD Ryzen™ 7000 Series Desktop Processors
M.2_1 slot (Key M), type 2242/2260/2280 (supports PCIe 5.0 x4 mode)
M.2_3 slot (Key M), type 2242/2260/2280 (supports PCIe 4.0 x4 mode)
AMD X670 Chipset
M.2_2 slot (Key M), type 2242/2260/2280/22110 (supports PCIe 3.0 x4 & SATA modes)**
M.2_4 slot (Key M), type 2242/2260/2280 (supports PCIe 4.0 x4 mode)

The actual configuration is one NVMe module in M.2_1, one in M.2_2 and the third in M.2_3. Two of them connected directly to the CPU, the third going via the chipset. I also tried M.2_1, M.2_3, M.2_4 but with similar results. BIOS being on auto settings, they run at their native speed (PCIe 4.0). I will try to lower to PCIe 3.0 or even 2.0 and see what happens.

I have the impression of being just above a certain threshold, not that far away.

Lyndeno · 2023-04-26T13:55:33Z

Lyndeno
Apr 26, 2023

It's interesting you're having issues with the SN770.

I was having issues with mine (2TB as well) in my laptop. ZFS, Btrfs on LVM/LUKS even ext4, my drive would reset just like yours in my laptop. Whether during boot or when sitting there doing nothing, or something. Seemingly random.

I took it to my computer store to get it replaced. Through their testing the drive passed all tests, so they did not replace it. I believe they were testing with windows.

I am going to RMA it with WD, hopefully my replacement performs better.

I have the exact same drive in my desktop(X570 5950X), using a single ZFS vdev as root. I have not experienced these issues. I would try putting the desktop drive in my laptop (XPS 9560)to see if it has issues but that would be quite an inconvenience to me. So I am just going to RMA it. The previous drive in my laptop did not have these issues.

This stuff occurred with both 512b and 4kb sectors I believe.

4 replies

admnd Apr 27, 2023
Author

Seems some other guys around encounter problems with this model (See links on the next comment bubble). This model has no DRAM cache and seems very prone to crash even idle it seems to crash. It its definitely not expected to see that (however a performance loss WAS).

My guess is those target the general market where I/O are not that heavy and only one module used with a machine not up 24h a day. Thus, WD engineers (maybe) have not put a high stress on those because it is not the use case they are supposed to fit in ;) WD is a reputable brand, products are tested. Companies are not always too big to fails but sometimes mistakes or more-or-less-stupid-management-decisions can be done for various reasons: not having though about a detail, cutting costs with sub-standard components, etc. Pure speculation at this point I cannot tell what the real cause is, I do not work at WD or have contacts there.

I am curious to see if your replacement improved your situation or if it it just as unstable as the replaced SN770.

"If you want high performance NVMe, use a model with DRAM". I learnt life the hard way on this one.

Lyndeno Apr 27, 2023

We'll see, I still have to send it in. But the SN770 in my desktop has been performing well, no errors to report.

I would have got a Firecuda (I do like Seagate, and in the case of nvme firmware upgrades, Seagate is way more Linux friendly) but for the capacity, it was almost double the price.

I am not doing heavy i/o normally, but I do game, compile and stuff on this computer and the WD has been performing fine. Which is why I suspect (I hope) it's a faulty drive in my laptop.

mabra Jun 22, 2023

Saw your message late.
I started with two FireCudas.
The first died after 4 weeks, the other one is causing pool-crashs and give messages like this:
Device: /dev/nvme0, number of Error Log entries increased from 756 to 760
According to the specs, they have ram as cache.
The WD never caused a problem for me.
Can say this, because my storage crashed again yesterday.

Lyndeno Aug 19, 2024

An update to my situation.

The RMA SN770 replacement was exhibiting the same issues on my laptop.

I have been running two SN770 in my desktop in a ZFS mirror for around a year and a half now. Recently, one of them is resetting/disconnecting, degrading the pool.

I have not checked to see if it is the same drive each time.

It seems to happen as a result of something. Sometimes, simply logging in to Gnome causes it to happen. Not sure why, as this pool does not hold any root files.

I also noticed it sometimes occurs when my phone starts backing up to immich, I have the postgres database stored on that mirror. I have not yet tried any troubleshooting, kernel params, settings, etc. Only change I have made is turn on the fan on my Hyper M.2 card. Still occurs occasionally.

My Firecuda 520 root on XFS has been rock solid for four years.

admnd · 2023-04-26T18:37:36Z

admnd
Apr 26, 2023
Author

Others pointers (FreeBSD):

At this point, I have opened a case with WD, perhaps something can be done at their level. As I should have some freetime tomorrow, I will try to exchange modules between my two machines.

3 replies

Lyndeno Apr 26, 2023

These are similar to other posts I have seen (different drives) where the power supply was the issue.

I am hoping that is not the case for my laptop. I guess I could replace the battery? But I just got a new battery last year.

In the meantime, I will continue with my RMA with WD. I hope it is simply a bad drive.

Lyndeno May 24, 2023

I have received a replacement SN770, within two days that drive started exhibiting the same problems as the last one.

I have ordered a cheap 1TB Timetec SSD for my laptop. It has been a few days so far and no issues. I will put the SN770 into my desktop to go with the other one. There seems to be some imcompatibility between the drive model and my laptop.

Lyndeno Aug 19, 2024

See my other comment #14793 (reply in thread), the WD drives are exhibiting these issues on my Asus desktop now.

admnd · 2023-04-28T00:27:02Z

admnd
Apr 28, 2023
Author

SN770 Swapped out for 3x WD SN 850 configured in 4K. Day & night! My 7950X is literally breathing again! Over 100K IOPS while emerging GCC 13, zpool scrubs are going easily to 5-6 GB/s.

Earlier this afternoon, I tried to swap one module at a time. Guess what? One SN 770 quit the pool seconds after the resilvering started, the second reset in the middle. I had thousands checksums errors reported. Fortunately I have daily snapshots stored on a TrueNAS box, so not an issue. This junk is even not able to sustain a pool resilvering.

So, gentlemen, moral of the story : Don't use DRAM-less NVMe stuff with ZFS
The troubles they bring do not worth it not counting they are a real bottleneck.

Will give news on what happens with my now famous SN 770 when I will have :) Perhaps they will do better in my secondary machine or in the junk-box.

Thank you, again, for jumping in and take some of your time to put suggestions here. This is greatly appreciated.

2 replies

mabra Jun 22, 2023

This does not explain, why each srub/resilver works fine for me with this model.
In opposite to my FireCuda, it even does not log errors.
For me, all the crashes followed a "return from hibernate", though not directly.

Lyndeno Aug 19, 2024

Scrubs also work just fine for me after rebooting after having one of the drives reset. Full speed

mabra · 2023-05-23T18:24:10Z

mabra
May 23, 2023

Stumpled over this by searching for consequences of my pool crash.
Just a side-note, I am not that deep in linux and modern hardware, as in earlier times.
I am using a ZFS mirror of two NVMEs, which are "Seagate FireCuda 520 SSD ZP2000" (2 TB) and "WD_BLACK SN770 2TB" (2 TB) in the original place on a Supermicro H12SSL-C motherboard with AMD EPYC 7252 (8 core) since about a year.
Originally, I started with two of the Firecudas, but one gave up very early and I made this experience with Seagate over and over my livetime and to come to a immidiate replace (because it is only a mirror), I bought the WD and was able to recover.
The first failed Firecuda was completely dead, looks like hw-only failure.
The crash, which leads to a loose of my complete pool, happened immidiate after return from hibernate (it is a workstation) .....,
which fails very often (using debian11) with kernel 6.1 (installed 14 days bevore!!).
See not any evidence, this this will be a ZFS problem, more the kernel ...
At this crash of 2023-05-19, the WD was the first one who has been checked, but the second (immidiately following) line was the Firecuda - but the order MAY say nothing, even though the ZED mails arrives in the same order.
Just as a note.

1 reply

mabra Jun 28, 2023

Found the debate about ZFS+HIBERNAT late, yesterday. There is even speaking, that something like "hibernate should not be used with ZFS" on the one side, and working on patches on the other hand.
Now, I can see, that my obersavtions - for my crash scenarios - was quite right - it happend always and only after return from hibernate. No crashes or errors otherwise with the mentioned disks WD/Seagate).

gregorst3 · 2023-06-08T18:07:46Z

gregorst3
Jun 8, 2023

Hello @admnd I'm experiencing the same problems on my server infrastructure, I recently added this wd nvme (sn850x) just for some low-spec VM that I did not prefer to run on my main nvme composed by different pm9a3.
As soon as I installed that nvme I got woken up during the night for a crash on my servers (random time , x days).
I found out that this can be related to a firmware problem on our nvme, I had to temporarily boot a Windows machine to update the firmware (because they only provide the tool only for windows) of the sn850x and after that seems like the problem is gone.

3 replies

admnd Jun 8, 2023
Author

No issues here with a pool composed of sn850x modules (and an older one with sn850 modules) but yes it is recommended to apply the latest updates from the manufacturer and, personally, this is the very first thing I do when I unbox a NVMe.

The issue appears with SN770 and probably some others DRAMless NVMe. Perhaps WD will release a fix in the future that correct the issue, until then, avoid that model.

posixpoet Mar 25, 2024

Firmware upgrades with Linux:
https://community.frame.work/t/western-digital-drive-update-guide-without-windows-wd-dashboard/20616

Thaodan Nov 29, 2024

I have also a WD SN850X. I never experienced issues with 4K LBA. Firmware is 620311WD.

Maybe this is something that is fixed in some WD SDD's but not in others.

x0rzavi · 2023-06-29T15:35:39Z

x0rzavi
Jun 29, 2023

I don't know if its related somehow but here's my 2 cents.

I had an SN570 500GB (dram less) NVMe, which was actually quite newish (less than 1 year old). I never had any issues initially with ZFS and gentoo on it, been using ZFS since the last 5 months. Until recently, I started noticing random kernel crashes and ZFS status reporting permanent errors while scrubbing. My RAM was perfectly fine concluding from the fact that memtest86+ tests reported pass twice consecutively.

To my surprise, upon rebooting to windows, WD dashboard reported that "NVM subsystem reliability has degraded" with 99% lifetime remaining. Even, SMART tests started failing. And unfortunately, the drive had to be replaced out.

0 replies

dm17 · 2023-07-04T17:55:18Z

dm17
Jul 4, 2023

Would be cool for a "ZFS NVMe Recommendations List" to come out of this discussion.

I imagine SLC and MLC NVMes would be above the rest. What are the other criteria of which ZFS users should be aware when identifying the best SSD hardware?

3 replies

justinclift Sep 1, 2023

As a potential starting point for this, these are the NVMe drive models we're using in our production servers (no issues at all for 12+ months):

SAMSUNG MZVL21T0HCLR-00B00 - 1TB model
KXG60ZNV1T02 TOSHIBA - 1TB model
SAMSUNG MZQLB1T9HAJR-00007 - 2TB model
SAMSUNG MZVLB1T0HBLR-00000 - 1TB model

They're all configured on our servers as ZFS mirrors, using two of each model per server. So, one server will have (say) 2x SAMSUNG MZVL21T0HCLR-00B00 1TB. Another server might have (say) 2x SAMSUNG MZQLB1T9HAJR-00007 2TB, etc.

justinclift Feb 23, 2024

~~For consumer level NVMe drives, the 2x (ZFS mirrored) 1TB Crucial CT1000P5SSD8 drives in my workstation have been working without issue since July 2021.~~

~~Would buy them again, but they don't seem to be available for sale any more. 😵‍💫~~

Since writing the above I've moved to using SAS drives (any generation really, but SAS3+ preferred) and no longer use consumer drives in my systems.

Ironically, it's actually cheaper to buy an Ebay SAS controller + a bunch of 2nd hand SAS SSDs (mostly with ~95% of their endurance left) than buy brand new SATA drives. And the SAS ones often have ~40x the endurance of consumer SATA drives. (!)

justinclift Jun 28, 2024

On the Proxmox forums, the Kingston DC1000B NVMe drives seem to be commonly recommended:

https://www.kingston.com/en/ssd/dc1000b-data-center-boot-ssd

Unfortunately they're tiny (480GB max), and the write speed of even those "large" 480GB ones is around SATA speeds. Their rated endurance is only 475TBW (.5 DWPD/5 years) so not great for write heavy use cases either.

rodrigoaguilera · 2023-08-24T12:44:01Z

rodrigoaguilera
Aug 24, 2023

I think I'm suffering from this on a 8TB Corsair MP600 PRO NH used as additional storage for a proxmox 8. rsync seems to trigger it specially.

The sledgehammer solution:

echo 1 > /sys/bus/pci/devices/0000\:12\:00.0/remove
echo 1 > /sys/bus/pci/rescan

Brings back the device for me but the zfs pool doesn't come back. I think it is because proxmox creates the pool with a /dev/nvme0nX and the X changes with every "resurrection".

I'm going to try ext4 next on that device and see how it goes.

I wanted to post here in case there is more people with the same device and similar problems.

2 replies

rodrigoaguilera Aug 28, 2023

Been stressing the drive with ext4 for a few days with fio, rsync and various file copying operations and no problem so far, 4 days uptime. With ZFS the controller died after 15-20 minutes of IO.

In the post above I forgot to mention that I was on the latest firmware 51.3

I won't be testing more on that drive with ZFS so I can't provide more info.

kftsehk Oct 18, 2023

have you tried force fsync on the test with ext4? rsync --fsync or so.

for the /dev/nvme0nX change, use /dev/disk/by-id/<find-your-disk-partition-id>, this id won't change when unplugged or resurrected

agrenott · 2023-10-14T19:53:49Z

agrenott
Oct 14, 2023

Just FYI, I had the exact same issue with a brand new WD BLACK SN770, and swapping my PSU solved the issue (while my previous one seemed perfectly fine)...

5 replies

agrenott Dec 6, 2023

Sad news, it's in fact not (only?) the power supply.
Just faced the issue on the exact same phisical config after updating to latest proxmox version (so not sure whether this is kernel and/or ZFS version related).
Kernel Linux proxmox 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64 GNU/Linux
ZFS

zfs-2.2.0-pve4
zfs-kmod-2.2.0-pve4

Skaronator Dec 6, 2023

Just offtopic, but make sure to update to 2.2.1 due to the data corruption bug "in" 2.2.0

agrenott Dec 6, 2023

Thanks! According to release notes it has been back ported into zfs-kmod-2.2.0-pve4.

justinclift Dec 6, 2023

Pretty sure there was some kind of serious bug found in 2.2.1 as well, so a 2.2.2 release should be out in short order.

fmagin Dec 7, 2023

Yes 2.2.1 had another similar looking issue, but it only showed up if you were using 4k sectors with LUKS #15533

kftsehk · 2023-10-18T21:47:00Z

kftsehk
Oct 18, 2023

[430771.216723] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[430771.216727] nvme nvme2: Does your device have a faulty power saving mode enabled?

Last time I saw this was with either firmware / hardware issue, RMA solves sometimes, if they return you a piece with newer version of firmware or an internal known defect fixed.

I would suggest not to buy same brand & model of the same batch for all vdev in a pool, that might put you at risk of faulting all disks if ever there is a hardware / firmware / manufacture issue.

0 replies

Xalaxis · 2024-11-19T19:55:15Z

Xalaxis
Nov 19, 2024

An update on this issue, could this strictly be a firmware bug to do with 4096 byte sector sizes (on the SN770) and nothing else? I've recently given my problematic SN770 to someone else who suddenly started having similar looking dropout issues in Windows. After reformatting to 512 byte mode the issues went away.

Someone else here reports issues with just the 4096 bytes mode: https://community.wd.com/t/sn770-nvme-controller-reset-when-formatted-with-4096-byte-sectors/282532

4 replies

admnd Nov 20, 2024
Author

Thank you for the hint @Xalaxis . That would probably explain why no one (except power users who switched to native 4k "sectors") encounters the issue as they are relying on the default "stable" 412b configuration (Windows might issue some kind of quirk not yet issued on Linux/FreeBSD, I assume it is not the case here).

A thing that could be tested: a zpool using a single NVMe module in both 512b/4k configurations. If drops-off happen only in 4k mode that would simply means WD has some serious undocumented hardware issue here. As the performance is already crippled by a small memory buffer in the computer RAM and no firmware update seems to fix the problem, those modules are nothing but pure cheap garbage. Had WD even tested this scenario? A "no" would be quite surprising, but who knows eh?

Another way is to use anything else but ZFS (with a significant I/O load) like BTRFS, XFS or EXT4. I am pretty confident to see the same crash.

In all cases, this is not a software (i.e. ZFS or Linux kernel) issue.

Asking WD for a RMA is absolutely useless as the issue is definitely a "by hardware design" one.

marcus905 Nov 20, 2024

Sadly those SSD (SN770M) fill a very specific niche (TLC + 2230 + 2TB + PCIe4) so it's somewhat bad to have this issue.

justinclift Nov 20, 2024

could this strictly be a firmware bug to do with 4096 byte sector sizes and nothing else?

It's doubtful, as there are reports of the problem happening in this GitHub issue even with 512 byte sectors. 😦

Another way is to use anything else but ZFS (with a significant I/O load) like BTRFS ...

There are also reports here (in this GitHub issue) of the crashes happening for people using Btrfs. 😦

mariusmuja Nov 28, 2024

This matches my experience: I initially formatted two SN770 with 4K sectors for better performance and I was getting the controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 error almost immediately after starting a scrub on the pool.

After reformatting the drives with 512 sectors, I've been using them without issues.

rbranson · 2024-12-20T19:47:54Z

rbranson
Dec 20, 2024

Also ran into this after ~10 weeks of having a pair of 2T SN770s in a raidz1. Kernel is 6.1.118. Both plugged into the chipset m.2 slots on my Supermicro X13SAE-F (W680 chipset). Saw this thread quickly and didn't bother trying to fix through other means. I swapped them with SN850X. Will report back if I see any issues with the SN850X.

kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"
kernel: nvme 0000:04:00.0: enabling device (0000 -> 0002)
kernel: nvme nvme1: Removing after probe failure status: -19
kernel: nvme1n1: detected capacity change from 3907029168 to 0

2 replies

Roydee34 Jul 1, 2025

Hi how was the stability with the SN850X?

Also which drive size did you use, I was considering getting 3 x 8tb sn850x drives to run in an ZFS raid1 pool on unraid.

rbranson Jul 1, 2025

Zero issues with SN850X.

Thaodan · 2024-12-22T22:56:40Z

Thaodan
Dec 22, 2024

Rick Branson ***@***.***> writes:

Also ran into this after ~10 weeks of having a pair of 2T SN770s in a raidz1. Kernel is 6.1.118. Both plugged into the chipset m.2 slots on my Supermicro X13SAE-F (W680 chipset). Saw this thread quickly and didn't bother trying to fix through other means. I swapped them with SN850X. Will report back if I see any issues with the SN850X.

I'm using the SN850X (WD_BLACK SN850X 2000GB) in 4k LBA mode with BTRFS (and LUKS) for about a year or longer now. No issues with it so far.

0 replies

xsmile · 2025-01-10T22:45:34Z

xsmile
Jan 10, 2025

I noticed the same issue while running fio benchmarks on two SN770 2TB drives configured with 4K.

A sequential read test with multiple jobs is enough to trigger this and can be reproduced each time, even on Windows with similar fio options.

I could not trigger a crash with 512B yet.

Disks

root@pve:~# nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme1n1          /dev/ng1n1            <    sn    >         WD_BLACK SN770 2TB                       1           2.00  TB /   2.00  TB    512   B +  0 B   731130WD
/dev/nvme2n1          /dev/ng2n1            <    sn    >         WD_BLACK SN770 2TB                       1           2.00  TB /   2.00  TB      4 KiB +  0 B   731130WD

Filesystems (default settings)

root@pve:~# df -hT
Filesystem           Type      Size  Used Avail Use% Mounted on
/dev/nvme1n1p1       ext4      1.8T   28K  1.7T   1% /mnt/1
/dev/nvme2n1p1       ext4      1.8T   28K  1.7T   1% /mnt/2

fio

root@pve:~# fio --name=test --filename=/mnt/2/fio --size=1g --direct=1 --rw=read --ioengine=libaio --numjobs=2 --group_reporting
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 2 processes
test: Laying out IO file (1 file / 1024MiB)
fio: io_u error on file /mnt/2/fio: Input/output error: read offset=650371072, buflen=4096
fio: io_u error on file /mnt/2/fio: Input/output error: read offset=650366976, buflen=4096
fio: pid=2184, err=5/file:io_u.c:1876, func=io_u error, error=Input/output error
fio: pid=2185, err=5/file:io_u.c:1876, func=io_u error, error=Input/output error

test: (groupid=0, jobs=2): err= 5 (file:io_u.c:1876, func=io_u error, error=Input/output error): pid=2184: Fri Jan 10 19:31:10 2025
  read: IOPS=9047, BW=35.3MiB/s (37.1MB/s)(1240MiB/35100msec)
    slat (nsec): min=1818, max=27657, avg=2075.13, stdev=241.31
    clat (usec): min=12, max=518, avg=28.15, stdev=18.09
     lat (usec): min=16, max=520, avg=30.23, stdev=18.10
    clat percentiles (usec):
     |  1.00th=[   16],  5.00th=[   16], 10.00th=[   16], 20.00th=[   17],
     | 30.00th=[   17], 40.00th=[   17], 50.00th=[   18], 60.00th=[   22],
     | 70.00th=[   28], 80.00th=[   53], 90.00th=[   60], 95.00th=[   60],
     | 99.00th=[   72], 99.50th=[   73], 99.90th=[  100], 99.95th=[  119],
     | 99.99th=[  351]
   bw (  KiB/s): min=173160, max=270936, per=100.00%, avg=254050.40, stdev=14263.76, samples=20
   iops        : min=43290, max=67734, avg=63512.60, stdev=3565.94, samples=20
  lat (usec)   : 20=57.93%, 50=18.48%, 100=23.49%, 250=0.08%, 500=0.02%
  lat (usec)   : 750=0.01%
  cpu          : usr=0.50%, sys=1.70%, ctx=317570, majf=0, minf=45
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=317565,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=35.3MiB/s (37.1MB/s), 35.3MiB/s-35.3MiB/s (37.1MB/s-37.1MB/s), io=1240MiB (1301MB), run=35100-35100msec

Logs

root@pve:~# dmesg
[  354.824794] nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  354.824799] nvme nvme2: Does your device have a faulty power saving mode enabled?
[  354.824800] nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[  354.842844] nvme 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  354.842939] nvme nvme2: Disabling device after reset failure: -19
[  354.848845] EXT4-fs (nvme2n1p1): shut down requested (2)
[  354.848856] Aborting journal on device nvme2n1p1-8.
[  354.848862] Buffer I/O error on dev nvme2n1p1, logical block 243826688, lost sync page write
[  354.848865] JBD2: I/O error when updating journal superblock for nvme2n1p1-8.

0 replies

ForsakenRei · 2025-02-02T23:25:59Z

ForsakenRei
Feb 2, 2025

I recently encoutered something similar with SN550 2TB on my Proxmox box.

I have a 4x 2TB raidz1 pool. No matter how I swap different slot, or different drives, as long as some intense I/O start a random drive of the raidz1 pool will become removed, the pool is still usable though but I never tried if I left the degraded pool for another intense I/O. I have tried different PCIe carrier card and different MB but nothing really helps. I have to reboot the machine to bring everything back online then zpool clear but the next time it will happen again.

I felt it might just be the controller for SN550 cannot handle ZFS? Another 970 EVO+ pool works just fine with whatever workload I threw at it.

0 replies

ctag · 2025-03-20T23:06:05Z

ctag
Mar 20, 2025

I think I just ran into this with WD BLACK SN850X 2000GB on TrueNAS Scale 24.10.2

Mar 20 10:06:46 bns-citadel kernel: zfs: module license taints kernel.
Mar 20 10:06:46 bns-citadel kernel: ZFS: Loaded module v2.2.99-1, ZFS pool version 5000, ZFS filesystem version 5
Mar 20 10:06:46 bns-citadel kernel: random: crng init done
Mar 20 10:06:46 bns-citadel kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Mar 20 10:06:46 bns-citadel kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Mar 20 10:06:46 bns-citadel kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Mar 20 10:06:46 bns-citadel kernel: nvme2n1: I/O Cmd(0x2) @ LBA 3907028864, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 20 10:06:46 bns-citadel kernel: I/O error, dev nvme2n1, sector 3907028864 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
Mar 20 10:06:46 bns-citadel kernel: nvme 0000:24:00.0: enabling device (0000 -> 0002)
Mar 20 10:06:46 bns-citadel kernel: nvme nvme2: Disabling device after reset failure: -19
Mar 20 10:06:46 bns-citadel kernel: Buffer I/O error on dev nvme2n1p1, logical block 488378352, async page read
Mar 20 10:06:46 bns-citadel kernel: Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
Mar 20 10:06:46 bns-citadel systemd[1]: Inserted module 'autofs4'
Mar 20 10:06:46 bns-citadel systemd[1]: systemd 252.26-1~deb12u2 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 >
Mar 20 10:06:46 bns-citadel systemd[1]: Detected architecture x86-64.

└─WD BLACK SN850X 2000GB:
      Device ID:          XXX
      Summary:            NVM Express solid state drive
      Current version:    620361WD
      Vendor:             Sandisk Corp (PCI:0x15B7)
      Serial Number:      XXX
      Problems:           • Device requires AC power to be connected
      GUIDs:              69749ddc-cd7a-542e-981f-xxx ← NVME\VEN_15B7&DEV_5030
                          53d3c0f5-42b2-5cc9-8a52-xxx ← NVME\VEN_15B7&DEV_5030&SUBSYS_15B75030
                          c6686f27-9161-5f60-a26e-xxx ← WD_BLACK SN850X 2000GB
      Device Flags:       • Internal device
                          • System requires external power source
                          • Needs a reboot after installation
                          • Device is usable for the duration of the update
                          • Updatable
                          • Can tag for emulation

5 replies

justinclift Mar 22, 2025

If you haven't already, then it's probably not a bad idea to try adding the kernel command line parameters suggested in line 6 of that kernel output message:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

admnd Mar 22, 2025
Author

If you haven't already, then it's probably not a bad idea to try adding the kernel command line parameters suggested in line 6 of that kernel output message:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

Showed no improvement for me. Perhaps is there some unknwon quirk somewhere? In the mean time, running SN770/750 in 4K mode is simply not a good idea as their controller is totally unstable. :(

ctag Mar 22, 2025

The crashes have stopped (for now) for me after removing one of the drives that was reporting 0B capacity after the pool was lost. Two of these SN850X drives left in the system, though as a striped array instead of Zraid this time, so that's another change.

Thaodan Mar 24, 2025

I have the exact same drive as you running in 4k mode without the issues you mention.
Although I'm not using ZFS but BTRFS it should be still comparable.

I wonder if the device is faulty or if the kernel version makes a difference. (6.13.5 in my case).

ctag Mar 24, 2025

Thanks for the input. I'm on kernel 6.6.44.

I haven't done the sysctl changes yet, and haven't had any more issues since switching from 3x raidz1 to 2x striped..

JiiPee74 · 2025-04-13T22:18:01Z

JiiPee74
Apr 13, 2025

This has nothing to do with ZFS. I have WD SN770 2TB drive what I bought last black friday sale. I was running it on windows and around new year I reformatted drive to use 4K LBA. And thats when all the issues started. Remember that there was windows 24H2 issue with this drive what WD did fix. First I tought that issue is not fixed because symphtons was kinda same but then I started to think that this must be different issue, I am not having data corruptions, just BSOD.

Sometimes I was able to boot just fine, other times I had to boot 10 times until I was able to use my computer again. Sometimes I wasnt even able to log in when BSOD happened but usually it happened after login or when I started one game.
Other times when I was able to log in and start game, system was working fine. Sometimes hours, sometimes days, even week. But at some point it always BSOD.

I started to google around and I found this and many other topics of this. All seems to have common that drive has been reformatted to 4K LBA and I am very positive that this is the issue with drive. I do not know if its possible from WD/Sandisk fix this via firmware, but I kinda doubt it because this has been going now 2 years and WD does know about it.
So I think this is hardware issue what cannot be firmware fixed. I am going to try and get this junk drive returned because it is not working as intented.

Oh and I bought Kingston Fury Renegade and cloned SN770. My windows has been stable as bedrock after that.
Its funny that you can sequential read/write whole drive and its fine. Then some small IO operation crash it. Like there was example of 'fio' in this topic. I have been able to replicate that crash with fio.

5 replies

toastal Apr 14, 2025

Would be awesome if we got a recall. But I doubt it. My anecdote was that WD said they don’t offer any support for HDDs for laptop OEMs. & the OEM, Lenovo, is acting like it’s not their problem—instead asking why I would dare format my drive & wipe Microsoft Windows (despite other markets shipping the device with Linux instructions).

bademux Apr 15, 2025

thank you so much for that information. I had the same problem with another WD nvme.
never again... I will not buy any other WD product - I spent several times more time on debugging IO errors on "4K LBA problem" then this shit ever worth.
In my case it is really funny bug occurs: computer wont boot if it warm, but if/when it boots it works normally.
512 LBA magically cure the bug. 100% reproducible bug

justinclift Apr 16, 2025

I will not buy any other WD product

Yeah, that's been my personal policy for years now. The only trouble is that WD has bought other brands (ie Sandisk) and dropped their quality through the floor as well. 😦

Thaodan Apr 20, 2025

The bug seems to "only" affect the DRAM-less versions of WD SSD's. Personally I would also have chosen another brand but in some instances WD is the only one available, i.e. one-sided 2232/2242 SSD's with 2TB.

ButterBarTheGr8 Aug 15, 2025

No, incorrect, ZFS with parity based pools in concert with WD SN750, SN770, SN850 drives experience this problem. I can't say definitively if non parity based topologies might also crash - but I haven't seen that yet. The other quirk is that even with a parity based pool, you might not see the crash if your transactions are small and spotty.

cburroughs · 2025-04-16T14:07:45Z

cburroughs
Apr 16, 2025

Thank you everyone SO MUCH for the detailed writeups in this thread. I wish I had seen it before I had made the purchase! But it saved me untold time and frustration afterwards.

0 replies

Boysa22 · 2025-05-04T16:12:03Z

Boysa22
May 4, 2025

Hello,
I was just reading this thread because I purchased SN580 drives. Just a long shot...but my drives came formatted for a "sector size" of 512 bytes. Maybe if ZFS ashift is set to 12 which is 4096 "sector size" this creates a stress to the NVMe controller if it is using 512 bytes of "sector size" and it crashes. Theoretically it should work fine....but maybe in this case it is not. So if someone wants, you can check your drives too by issuing nvme id-ns -H /dev/nvme0n1 | grep "Relative Performance" and see what format is currently used. Then check your ZFS ashift value and see if it matches. It is clear the SSDs do not use "sector size", but pages, but this is how it is reported to the system. Maybe if they match, these SSD drives will behave more stable...or maybe if they are used in 512 bytes format with ZFS ashift with value 9, this will be more stable too.

14 replies

MarkOnDuty Jun 6, 2025

Having never heard of Teamgroup, I wouldn't likely trust my data to them.

Not a recommendation, but Teamgroup drives tends to rank well on tier lists. They have a wide variety drives with different quality/performance/reliability/value.

I tend to put emphasis on the I in RAID.

charettepa Jun 6, 2025

ohhhhh crap. i ended up puling the trigger and buying a pair of 990 pro's. i hope it works out. how long doe sit take for it to fail when it does? my TimeTech module never went more than 3 hours without dropping offline. they did tell me i have 15 days for full refund.

MarkOnDuty Jun 9, 2025

how long doe sit take for it to fail when it does? my TimeTech module never went more than 3 hours without dropping offline. they did tell me i have 15 days for full refund.

We weren't really paying attention when we first deployed the drives, and didn't catch the drive being down for about three months. When the drive went down in that time-frame is anybody's guess. Once we started paying attention, and set up ZFS to send emails when a failure occurred, the drive lasted about two weeks. Over the following couple of months, it seemed to drop to about four days before we finally replaced the drive.

But we also have another 990 Pro that doesn't have any problems, both before and after we updated the firmware.

no-usernames-left Jun 9, 2025

Does the 990 have a different controller than the 980? I have two 980 Pro in ZFS mirror for root on Proxmox and they have never given a hint of trouble.

MarkOnDuty Jun 9, 2025

Does the 990 have a different controller than the 980? I have two 980 Pro in ZFS mirror for root on Proxmox and they have never given a hint of trouble.

I'm not sure, but probably. We've ran a smattering of Samsung and WD drives without issue (other than the wear bug on the Samsung drives). But the 2TB and 4TB 990 Pro with Heatsink have been a lot of trouble. Two out of three had this issue.

varlesh · 2025-06-21T17:10:12Z

varlesh
Jun 21, 2025

Join the team. WD_BLACK SN770 1TB
The disk was used for data, the EXT4 file system
I didn't think that such an eminent brand as Sandisk was capable of such terrible quality! I'm disappointed...
SSD is only 7 months old, there was a failure when downloading a 7gb file, the temperatures were fine, the wear was 1%.
Now it can be thrown in the trash.

smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.15.2-arch1-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WD_BLACK SN770 1TB
Serial Number:                      241150809636
Firmware Version:                   731120WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1 000 204 886 016 [1,00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1 000 204 886 016 [1,00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b47c619af
Local Time is:                      Sat Jun 21 15:56:18 2025 +04
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x7e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Log0_FISE_MI Telmtry_Ar_4
Maximum Data Transfer Size:         256 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.00W    5.00W       -    0  0  0  0        0       0
 1 +     3.30W    3.00W       -    0  0  0  0        0       0
 2 +     2.20W    2.00W       -    0  0  0  0        0       0
 3 -   0.0150W       -        -    3  3  3  3     1500    2500
 4 -   0.0050W       -        -    4  4  4  4    10000    6000
 5 -   0.0033W       -        -    5  5  5  5   176000   25000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x04
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    63 452 580 [32,4 TB]
Data Units Written:                 32 885 998 [16,8 TB]
Host Read Commands:                 387 780 429
Host Write Commands:                537 618 483
Controller Busy Time:               1 995
Power Cycles:                       304
Power On Hours:                     3 443
Unsafe Shutdowns:                   99
Media and Data Integrity Errors:    5 004
Error Information Log Entries:      5 004
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               61 Celsius
Temperature Sensor 2:               41 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed: failed segments             3427            -     -   2   -    -
 1   Short             Completed: failed segments             3417            -     -   2   -    -

1 reply

no-usernames-left Jun 21, 2025

SSD is only 7 months old [...] Now it can be thrown in the trash.

Whatever happened to filing a warranty claim?!

Roydee34 · 2025-07-01T10:20:34Z

Roydee34
Jul 1, 2025

Thanks for this great detailed thread, am new to NAS but was considering using ZFS raid1/similar with all NVMEs (maybe 3 or 4) on unraid or truenas, is it still better to avoid WD NVME drives and stick with Samsung/Crucial/Kingston?

I did have my eye on the SN850X 8TB nvme drives due to lowest price.

7 replies

ForsakenRei Jul 3, 2025

I used a mix of SK Hynix and Crucial SSDs, no problem with mirror/stripe and raidz1/z2

JiiPee74 Jul 3, 2025

A Samsung 990 Pro (with heat sink) that regularly fails is sitting on my desk right now. I use it in a USB-to-M.2 adapter as a 4TB USB drive. It works fine for this, but that isn't what it was bought to do.

I think samsung drives does not support 4K lba at all so if you want to use 4K lba then samsung is not an option.

charettepa Jul 8, 2025

I have 2 990 pros in a raid z1 and its been working fine

wiesl Jul 9, 2025

I have 2x Samsung 990 Pro 4TB in a Linux RAID 1 configuration without any problem (one SSD had not the latest firmware version for around one year until recently updated).

Maybe you have a other hardware issue?

MarkOnDuty Jul 9, 2025

I have 2x Samsung 990 Pro 4TB in a Linux RAID 1 configuration without any problem (one SSD had not the latest firmware version for around one year until recently updated).

Maybe you have a other hardware issue?

It's definitely the drives.

I also have 2x Samsung 990 Pro 4TB in a Linux (ZFS) RAID 1 configuration that work, but I also have 2 that don't. I've tested the drives in (wildly) different systems with the same result. The work fine in USB-to-M.2 adapters, but that's not what we bought them for.

There was also an excessive wear problem with Samsung M.2 drives in the last couple of years. That one made a bunch of work for us. With that problem too, some drives were impacted, some weren't.

Before these two problems hit us, I was tremendously happy with Samsung's drives. Over a decade of use, I never had a problem with their SATA drives. The M.2 drives are another story and I'm now leery. It will be some time before I try them again.

codgician · 2025-08-11T10:04:01Z

codgician
Aug 11, 2025

I also observed the same behavior on my 1TB SN5000, even with latest firmware 291020WD. I have a mirrored zpool, and a full disk ZFS resilvering is highly likely to cause IO hang on both of my drives. I am also using 4K LBA and haven’t tested whether this issue exist under 512B LBA. Ended up replacing these drives with another model (and had a tough time resilvering data).

0 replies

a1bert01 · 2025-08-13T08:49:53Z

a1bert01
Aug 13, 2025

FYI: SN580 1TB under windows10/11 with latest firmware 281040wd, the same behaviour after switching to 4096 lba , reproducible by fio with numjobs=4

0 replies

ButterBarTheGr8 · 2025-08-15T23:15:32Z

ButterBarTheGr8
Aug 15, 2025

Posting here to A) Thank the OP, B) spread the word....

TL/DR_SUMMARY: The OP is 100% correct, this IS some kind of a problem between ZFS, the WD drives (SN770 and SN850S, SN850XE), and maybe even the underlying hardware. Better said, it's a particular chemistry of calamity that ultimately results in the problems everyone is describing. A drive will randomly drop out the zpool, write errors will be seen, and generally nothing other than a reboot will reset the drive controller, thus allowing zfs to resilver and heal the pool. I've spent WAY too much time on this and ultimately, switching filesystems was the fix. So here is how I got there and maybe some help for you.

DETAILS: I started with the 'ole trusty mdadm to build an array from 12 x 4TB SN850XE drives shucked from USB3 cases. Before you say shucked drives are the problem - just know I verified the controller, controller firmware, clock mechanism, and memory chips are identical to the SN850X, available as a standalone drive. For some time I thought the shucking trick was my enemy. Nope. I used some PCIe 4.0 x16 to 4x(x4) adapters found here to place the drives in three of the five PCIe x16 slots available. Supermicro H12SSL-i, AMD Epyc 7352, 256GB of memory. I wasn't happy with the contact mechanism between thermal pad and drive, but more on that later. A mdadm RAID 5 array would fail building itself around the 80-90% mark, every time I tried for about ten different attempts. RAID 0, 1, 10 were all fine, but not when distributed parity was a player. I changed build flags and settings, sector sizes, an array-of-partitions instead of disks. I went through lvmraid and snapraid (both of which rely on the md subsystem). Failed every time.

Another factor here is heat, these little things get HOT. So i switched to these drive carriages, which because of the screws in the middle of the heatsink, had better contact with the thermal pads used. More mdadm attempts, more failures.

Enter ZFS. I've always been a little shaky with ZFS because of it's proximity to the kernel but building a ZFS pool doesn't carry the bitmap overhead and drive geometry mapping that mdadm has. Building a RAID 5 zpool was a snap and I was mounted with encrypted and unencrypted datasets immediately. But just like all the others above me in this thread - large file transfers and even sustained small file transfers would kill the system. So next I started digging through dmesg.

Since this is a Proxmox box and I used SR-IOV and PCI passthrough religiously, PCI Advanced Error Reporting (AER) and PCI Access Control (ACS) had to be enabled. That instantly produces the below.

kernel: nvme 0000:87:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
kernel: {204}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
kernel: {204}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: {204}[Hardware Error]: event severity: corrected
kernel: {204}[Hardware Error]:  Error 0, type: corrected
kernel: {204}[Hardware Error]:   section_type: PCIe error
kernel: {204}[Hardware Error]:   port_type: 0, PCIe end point
kernel: {204}[Hardware Error]:   version: 0.2
kernel: {204}[Hardware Error]:   command: 0x0407, status: 0x0010
 kernel: {204}[Hardware Error]:   device_id: 0000:c6:00.0
kernel: {204}[Hardware Error]:   slot: 0
kernel: {204}[Hardware Error]:   secondary_bus: 0x00
kernel: {204}[Hardware Error]:   vendor_id: 0x15b7, device_id: 0x5030
kernel: {204}[Hardware Error]:   class_code: 010802
kernel: {204}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
kernel: nvme 0000:c6:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
kernel: nvme 0000:c6:00.0:    [ 0] RxErr                  (First)
kernel: nvme 0000:c6:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

These errors will show up every few seconds and make sure you've got log rotation turned on, or you've turned off AER logging with a boot flag, else you're going to be exhausting drive capacity in a few hours. Please let me save you countless hours of digging through kernel dev forums and just tell you this is a complete red herring. AMD Epyc series processors are very "chatty" about the PCIe Bus. The slightest re-ordering of bus transaction data results in similar above messages, of which most CPUs DO NOT flag the kernel over. You may even wind up on a forum where an AMD engineer calls this a firmware errata since corrected in later generations of Epyc and Ryzen CPUs. It's also quite dependent on the underlying northbridge controller in the CPU.

Alas, it's a red herring. And since the hardware error (the WD drive is complaining about transactional re-ordering) is corrected by the drive controller - it's quite "normal" and NOT contributory to the problem.

BACK TO THE STORY: I went through another PCIe card from Dell that can handle the 22110 drives but still the same failures with ZFS. Sometimes I could get 8 or 10 TB transferred (I used straight CIFS, rsync, NFS, and others), and sometimes just a few GB. Sometimes the pool would pause and the transfer would continue for awhile, and then after a hard failure and reboot, resilver itself and heal for the amount of data I was able to transfer. Eventually I switched to SFF-8654 carrier cards and Silverstone active cooler carriages. Heat would not beat me! But still the same problem, nearly repeatable for every zpool flag, feature, anything I could switch on or off.

I downgraded the PCIe negotiation rate to gen 3.0 - same problem.
I thought ASPM might be my enemy so I disabled that first in the OS, then in BIOS - same problem.
I disabled SR-IOV and ACS because maybe the drives needed a direct DMA-based communication pathway to other PCIe devices - same problem.
I tried different sector sizes (ashift) and even found a way to emulate 512kb based partitions - same problem.
I changed OS families to RedHat based distros (Proxmox is Debian) thinking the ZFS modules were borked - same problem.
I rate limited rsync because when transferring from a completely SATA array, I was getting 750MB/s, faster than the SATA III spec. I still can't explain why an rysnc would show that number, but it seemed anomalous. Even limiting rsync to 100MB/s, the ZFS pool would still fail - same problem.

I placed thermal sensors on the drives and the Silverstone carriages are won-dee-ful. They were keeping the drives around 50 degrees, or lower. Heat was thus not a factor.

So in a final effort to maintain allegiance to ZFS, I swapped the drives to a completely Samsung platform (980 Pro). Two things happened...one the number of AER messages got cut in half. Two, no more pool crashes and removed drives!!! All other things unchanged, that told me that some bit of chemistry between the CPU, the board, the drives, and ZFS was the problem. So I then tested on a SuperMicro X10SDV board - albeit with a single PCIe card, bifurcated x4x4x4x4 and running at PCIe gen 3.0 speeds. Nope, ZFS and the WD drives still broke. Samsung drives were A-ok. That's an intel board with a completely different IOMMU, AER, and ACS structure.

So the final conclusion here, after all that testing is that ZFS pools, definitively when the pool uses a parity structure (RAIDz.*), are not compatible with current generation WD M.2 NVME drives. The OP's hypothesis of burst writes might be the culprit.

FINALITY: With ZFS and mdadm cooked, I switched to RAID 5 BTRFS. Not a single problem. rsync transfer rates are 600MB/s from a pure SATA ZFS array of 24 x 2TB M.2 drives. That's less than what rsync reported on the ZFS pool, but it's also realistic. SMART load tests show 6000MB/s, on part for these drives. Nothing special, no unique flags for the BTRFS RAID 5 array, i don't even use commit=120. But I can copy hundreds of TB's back and forth with not a single problem. So here's what I know to be true:

If you think ZFS isn't a player here - you're wrong. Queue the trolls and their war drums! Single drives using ext4, XFS, NTFS were rock solid for me. SMART heavy load tests, sustained transfers etc. If you drive is failing in these scenarios, it's a bigger issue.
RAID 0, 1, 10 don't seems have this problem with WD M.2 NVME drives. Only raidz, z2 and the parity flavors of ZFS. YMMV though. I never stress tested these configurations as I like me some parity.
You might NEVER see this problem manifest if all you have is small and infrequent file transfers.
You can't reset a drive and re-add/replace/clear the drive once ZFS removes it. The drive controller needs a reset signal or power cycle.

5. Use BTRFS with these drives. The write-hole problem was fixed.

2 replies

mariusmuja Aug 15, 2025

For me the issue was fixed by formatting the drives to 512 sectors from 4K sectors. The same drives that were immediately failing before have been working fine on a ZFS pool for 1+ years.

Use BTRFS ...

No way (not ready to trust BTRFS with my data after it ate it last time...)

ButterBarTheGr8 Aug 16, 2025

Fair - there's still massive skepticism over BTRFS after the write hole and RAID5/6 debacle. Arch still vilifies it, so goes the Internet. For me, it's all that's left and it works fine. I backup everything in triplicate across three different filesystems. I tried the 512 trick. Though the SN850 series is 4k native, you can program the controller to 512e. Same problem. Since I traffic in 30GB+ files, I didn't want to hamstring a 4K native drive with a 512e sector size. To each their own though...

wiesl · 2025-08-16T04:02:56Z

wiesl
Aug 16, 2025

Did you all try latest ZFS versions? 2.2.8 or 2.3.3?

Maybe you are hit by:

Today's update to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a (almost identical to current 2.3.0 RC) caused permanent pool corruption #16631
CKSUM and WRITE errors when receiving snapshots or scrubbing (2.2.4, LUKS) #15646

fixed by #16687

0 replies

Unsuitable SSD/NVMe hardware for ZFS - WD BLACK SN770 and others #14793

Uh oh!

Uh oh!

Hardware

Issue observed

What has been tried so far

Some thoughts / ideas of tests to try

Replies: 56 comments · 184 replies

Uh oh!

Uh oh!

admnd Apr 25, 2023 Author

Uh oh!

Uh oh!

admnd Apr 25, 2023 Author

Uh oh!

Uh oh!

Uh oh!

admnd Apr 25, 2023 Author

Uh oh!

Uh oh!

admnd Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

admnd Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

admnd Apr 27, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

admnd Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

admnd Apr 28, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 56 comments 184 replies

admnd
Apr 25, 2023
Author

admnd
Apr 25, 2023
Author

admnd
Apr 25, 2023
Author

admnd Apr 26, 2023
Author

admnd Apr 26, 2023
Author

admnd Apr 27, 2023
Author

admnd
Apr 26, 2023
Author

admnd
Apr 28, 2023
Author