Replies: 56 comments 184 replies
-
Found something interesting in a proposed patch in a discussion whose topic was "[PATCH] nvme-pci: fix host memory buffer allocation size" dating of may 10th 2022. The starting point of the discussion start here => https://www.spinics.net/lists/kernel/msg4339024.html At some point (https://www.spinics.net/lists/kernel/msg4352567.html), it is mentioned that:
Also in a subsequent message ( https://www.spinics.net/lists/kernel/msg4372632.html ) it is also mentioned that the situation has improved drastically with the patch. And another point of the discussion about having the Host Memory Buffer of just 32MB. According to my logs, I have the same allocation:
For the record, here is excerpts of some messages:
Current parameters for the nvme kernel modules on my system are on their defaults:
Going though the code of
The patch in question is mentioned at the very beginning of the discussion and is this one:
Another related thread is here => https://lore.kernel.org/linux-nvme/[email protected]/
|
Beta Was this translation helpful? Give feedback.
-
Above patch tried, but in my case, worsens the issue :( The crash happens much more earlier than before. |
Beta Was this translation helpful? Give feedback.
-
Basically at this point, I am out of options with those sticks. Those are a replacement for a trio of ADATA Gammix S70 Blade which were also problematic because their namespace had a bad value for EUI64: Basically all were all set to eui64=0000000000000000 which made the system totally confused about who was who. So my only option at this point is to get another model :/ Perhaps I will keep them for a much-less intensive use. Reality is: not all NVMe hardware can play nicely with ZFS. It seems that investing in higher end of hardware is not an option, especially with ZFS. I won't ever consider switching them back to 512b sectors, I don't think this will solve the issue and if ever it solves it, there is a significant performance penalty. Hoping my hours of investigations would avoid someone wasting money in junk hardware. It is a bit disappointing that this junk is coming from a well-known brand. PS: Free feel to further elaborate. I will post if I get something new on this. |
Beta Was this translation helpful? Give feedback.
-
I would try to replace the PSU with another one and probably 1000W one.
Often mysterious problems end up with replacing faulty PSU.
…On Wed, Apr 26, 2023 at 9:23 AM admnd ***@***.***> wrote:
Above patch tried, but in my case, worsens the issue :( The crash happens
much more early than before.
Fiddling around with parameters of nvme.ko, I managed to have a higher
allocation of 200 MB with nvme.max_host_mem_size_mb=512 + the above patch
applied.
—
Reply to this email directly, view it on GitHub
<#14793 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXQ6HOVYHJWDVAHYS4RWYDXDBMHPANCNFSM6AAAAAAXLAAQ7E>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
This might be a longshot, but where have you connected your NVMe? Did you use the onboard slots or a riser card with bifurcation? And if you used the onboard slots which ones did you use? From the Manual you can see one of the slots shares bandwith with the Sata Ports if theres anything in there it could cause a Problem. Further x670 daisy chanins 2x the x670 chipset to give more connectivity. A Guess off mine could be that this issue could be cause by limited bandwith between chipsets and the CPU which might cause the controller to look like its dropping. My suggestion to troubleshoot this, is to get a bifurcating riser card put it in the 16x Slot and have all the NVMes directly connected to the CPU. This would eliminate going over the Chipsets. Unfortunatly ASUS has no blockdiagram of the Board and where which PCIe Lanes go with which speed. But I would see if limiting the speed of the drives could also be causing this issue. PCIe Switching link speed caused me a lot of headaches with my rx5700 xt GPU. It caused some weird issue of it disconnecting crashing the drivers etc. So pretty similar to what you experience. Those 2 would be my guesses for this issue. |
Beta Was this translation helpful? Give feedback.
-
It's interesting you're having issues with the SN770. I was having issues with mine (2TB as well) in my laptop. ZFS, Btrfs on LVM/LUKS even ext4, my drive would reset just like yours in my laptop. Whether during boot or when sitting there doing nothing, or something. Seemingly random. I took it to my computer store to get it replaced. Through their testing the drive passed all tests, so they did not replace it. I believe they were testing with windows. I am going to RMA it with WD, hopefully my replacement performs better. I have the exact same drive in my desktop(X570 5950X), using a single ZFS vdev as root. I have not experienced these issues. I would try putting the desktop drive in my laptop (XPS 9560)to see if it has issues but that would be quite an inconvenience to me. So I am just going to RMA it. The previous drive in my laptop did not have these issues. This stuff occurred with both 512b and 4kb sectors I believe. |
Beta Was this translation helpful? Give feedback.
-
Others pointers (FreeBSD):
At this point, I have opened a case with WD, perhaps something can be done at their level. As I should have some freetime tomorrow, I will try to exchange modules between my two machines. |
Beta Was this translation helpful? Give feedback.
-
SN770 Swapped out for 3x WD SN 850 configured in 4K. Day & night! My 7950X is literally breathing again! Over 100K IOPS while emerging GCC 13, zpool scrubs are going easily to 5-6 GB/s. Earlier this afternoon, I tried to swap one module at a time. Guess what? One SN 770 quit the pool seconds after the resilvering started, the second reset in the middle. I had thousands checksums errors reported. Fortunately I have daily snapshots stored on a TrueNAS box, so not an issue. This junk is even not able to sustain a pool resilvering. So, gentlemen, moral of the story : Don't use DRAM-less NVMe stuff with ZFS Will give news on what happens with my now famous SN 770 when I will have :) Perhaps they will do better in my secondary machine or in the junk-box. Thank you, again, for jumping in and take some of your time to put suggestions here. This is greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
Stumpled over this by searching for consequences of my pool crash. |
Beta Was this translation helpful? Give feedback.
-
Hello @admnd I'm experiencing the same problems on my server infrastructure, I recently added this wd nvme (sn850x) just for some low-spec VM that I did not prefer to run on my main nvme composed by different pm9a3. |
Beta Was this translation helpful? Give feedback.
-
I don't know if its related somehow but here's my 2 cents. I had an SN570 500GB (dram less) NVMe, which was actually quite newish (less than 1 year old). I never had any issues initially with ZFS and gentoo on it, been using ZFS since the last 5 months. Until recently, I started noticing random kernel crashes and ZFS status reporting permanent errors while scrubbing. My RAM was perfectly fine concluding from the fact that memtest86+ tests reported pass twice consecutively. To my surprise, upon rebooting to windows, WD dashboard reported that "NVM subsystem reliability has degraded" with 99% lifetime remaining. Even, SMART tests started failing. And unfortunately, the drive had to be replaced out. |
Beta Was this translation helpful? Give feedback.
-
Would be cool for a "ZFS NVMe Recommendations List" to come out of this discussion. I imagine SLC and MLC NVMes would be above the rest. What are the other criteria of which ZFS users should be aware when identifying the best SSD hardware? |
Beta Was this translation helpful? Give feedback.
-
I think I'm suffering from this on a 8TB Corsair MP600 PRO NH used as additional storage for a proxmox 8. rsync seems to trigger it specially. The sledgehammer solution:
Brings back the device for me but the zfs pool doesn't come back. I think it is because proxmox creates the pool with a /dev/nvme0nX and the X changes with every "resurrection". I'm going to try ext4 next on that device and see how it goes. I wanted to post here in case there is more people with the same device and similar problems. |
Beta Was this translation helpful? Give feedback.
-
Just FYI, I had the exact same issue with a brand new WD BLACK SN770, and swapping my PSU solved the issue (while my previous one seemed perfectly fine)... |
Beta Was this translation helpful? Give feedback.
-
Last time I saw this was with either firmware / hardware issue, RMA solves sometimes, if they return you a piece with newer version of firmware or an internal known defect fixed. I would suggest not to buy same brand & model of the same batch for all vdev in a pool, that might put you at risk of faulting all disks if ever there is a hardware / firmware / manufacture issue. |
Beta Was this translation helpful? Give feedback.
-
An update on this issue, could this strictly be a firmware bug to do with 4096 byte sector sizes (on the SN770) and nothing else? I've recently given my problematic SN770 to someone else who suddenly started having similar looking dropout issues in Windows. After reformatting to 512 byte mode the issues went away. Someone else here reports issues with just the 4096 bytes mode: https://community.wd.com/t/sn770-nvme-controller-reset-when-formatted-with-4096-byte-sectors/282532 |
Beta Was this translation helpful? Give feedback.
-
Also ran into this after ~10 weeks of having a pair of 2T SN770s in a raidz1. Kernel is 6.1.118. Both plugged into the chipset m.2 slots on my Supermicro X13SAE-F (W680 chipset). Saw this thread quickly and didn't bother trying to fix through other means. I swapped them with SN850X. Will report back if I see any issues with the SN850X.
|
Beta Was this translation helpful? Give feedback.
-
Rick Branson ***@***.***> writes:
Also ran into this after ~10 weeks of having a pair of 2T SN770s in a
raidz1. Kernel is 6.1.118. Both plugged into the chipset m.2 slots on
my Supermicro X13SAE-F (W680 chipset). Saw this thread quickly and
didn't bother trying to fix through other means. I swapped them with
SN850X. Will report back if I see any issues with the SN850X.
I'm using the SN850X (WD_BLACK SN850X 2000GB) in 4k LBA mode with BTRFS
(and LUKS) for about a year or longer now. No issues with it so far.
|
Beta Was this translation helpful? Give feedback.
-
I noticed the same issue while running fio benchmarks on two SN770 2TB drives configured with 4K. A sequential read test with multiple jobs is enough to trigger this and can be reproduced each time, even on Windows with similar fio options. I could not trigger a crash with 512B yet. Disks
Filesystems (default settings)
fio
Logs
|
Beta Was this translation helpful? Give feedback.
-
I recently encoutered something similar with SN550 2TB on my Proxmox box. I have a 4x 2TB raidz1 pool. No matter how I swap different slot, or different drives, as long as some intense I/O start a random drive of the raidz1 pool will become removed, the pool is still usable though but I never tried if I left the degraded pool for another intense I/O. I have tried different PCIe carrier card and different MB but nothing really helps. I have to reboot the machine to bring everything back online then I felt it might just be the controller for SN550 cannot handle ZFS? Another 970 EVO+ pool works just fine with whatever workload I threw at it. |
Beta Was this translation helpful? Give feedback.
-
I think I just ran into this with
|
Beta Was this translation helpful? Give feedback.
-
This has nothing to do with ZFS. I have WD SN770 2TB drive what I bought last black friday sale. I was running it on windows and around new year I reformatted drive to use 4K LBA. And thats when all the issues started. Remember that there was windows 24H2 issue with this drive what WD did fix. First I tought that issue is not fixed because symphtons was kinda same but then I started to think that this must be different issue, I am not having data corruptions, just BSOD. Sometimes I was able to boot just fine, other times I had to boot 10 times until I was able to use my computer again. Sometimes I wasnt even able to log in when BSOD happened but usually it happened after login or when I started one game. I started to google around and I found this and many other topics of this. All seems to have common that drive has been reformatted to 4K LBA and I am very positive that this is the issue with drive. I do not know if its possible from WD/Sandisk fix this via firmware, but I kinda doubt it because this has been going now 2 years and WD does know about it. Oh and I bought Kingston Fury Renegade and cloned SN770. My windows has been stable as bedrock after that. |
Beta Was this translation helpful? Give feedback.
-
Thank you everyone SO MUCH for the detailed writeups in this thread. I wish I had seen it before I had made the purchase! But it saved me untold time and frustration afterwards. |
Beta Was this translation helpful? Give feedback.
-
Hello, |
Beta Was this translation helpful? Give feedback.
-
Join the team. WD_BLACK SN770 1TB
|
Beta Was this translation helpful? Give feedback.
-
Thanks for this great detailed thread, am new to NAS but was considering using ZFS raid1/similar with all NVMEs (maybe 3 or 4) on unraid or truenas, is it still better to avoid WD NVME drives and stick with Samsung/Crucial/Kingston? I did have my eye on the SN850X 8TB nvme drives due to lowest price. |
Beta Was this translation helpful? Give feedback.
-
I also observed the same behavior on my 1TB SN5000, even with latest firmware 291020WD. I have a mirrored zpool, and a full disk ZFS resilvering is highly likely to cause IO hang on both of my drives. I am also using 4K LBA and haven’t tested whether this issue exist under 512B LBA. Ended up replacing these drives with another model (and had a tough time resilvering data). |
Beta Was this translation helpful? Give feedback.
-
FYI: SN580 1TB under windows10/11 with latest firmware 281040wd, the same behaviour after switching to 4096 lba , reproducible by fio with numjobs=4 |
Beta Was this translation helpful? Give feedback.
-
Posting here to A) Thank the OP, B) spread the word.... TL/DR_SUMMARY: The OP is 100% correct, this IS some kind of a problem between ZFS, the WD drives (SN770 and SN850S, SN850XE), and maybe even the underlying hardware. Better said, it's a particular chemistry of calamity that ultimately results in the problems everyone is describing. A drive will randomly drop out the zpool, write errors will be seen, and generally nothing other than a reboot will reset the drive controller, thus allowing zfs to resilver and heal the pool. I've spent WAY too much time on this and ultimately, switching filesystems was the fix. So here is how I got there and maybe some help for you. DETAILS: I started with the 'ole trusty mdadm to build an array from 12 x 4TB SN850XE drives shucked from USB3 cases. Before you say shucked drives are the problem - just know I verified the controller, controller firmware, clock mechanism, and memory chips are identical to the SN850X, available as a standalone drive. For some time I thought the shucking trick was my enemy. Nope. I used some PCIe 4.0 x16 to 4x(x4) adapters found here to place the drives in three of the five PCIe x16 slots available. Supermicro H12SSL-i, AMD Epyc 7352, 256GB of memory. I wasn't happy with the contact mechanism between thermal pad and drive, but more on that later. A mdadm RAID 5 array would fail building itself around the 80-90% mark, every time I tried for about ten different attempts. RAID 0, 1, 10 were all fine, but not when distributed parity was a player. I changed build flags and settings, sector sizes, an array-of-partitions instead of disks. I went through lvmraid and snapraid (both of which rely on the md subsystem). Failed every time. Another factor here is heat, these little things get HOT. So i switched to these drive carriages, which because of the screws in the middle of the heatsink, had better contact with the thermal pads used. More mdadm attempts, more failures. Enter ZFS. I've always been a little shaky with ZFS because of it's proximity to the kernel but building a ZFS pool doesn't carry the bitmap overhead and drive geometry mapping that mdadm has. Building a RAID 5 zpool was a snap and I was mounted with encrypted and unencrypted datasets immediately. But just like all the others above me in this thread - large file transfers and even sustained small file transfers would kill the system. So next I started digging through dmesg. Since this is a Proxmox box and I used SR-IOV and PCI passthrough religiously, PCI Advanced Error Reporting (AER) and PCI Access Control (ACS) had to be enabled. That instantly produces the below.
These errors will show up every few seconds and make sure you've got log rotation turned on, or you've turned off AER logging with a boot flag, else you're going to be exhausting drive capacity in a few hours. Please let me save you countless hours of digging through kernel dev forums and just tell you this is a complete red herring. AMD Epyc series processors are very "chatty" about the PCIe Bus. The slightest re-ordering of bus transaction data results in similar above messages, of which most CPUs DO NOT flag the kernel over. You may even wind up on a forum where an AMD engineer calls this a firmware errata since corrected in later generations of Epyc and Ryzen CPUs. It's also quite dependent on the underlying northbridge controller in the CPU. Alas, it's a red herring. And since the hardware error (the WD drive is complaining about transactional re-ordering) is corrected by the drive controller - it's quite "normal" and NOT contributory to the problem. BACK TO THE STORY: I went through another PCIe card from Dell that can handle the 22110 drives but still the same failures with ZFS. Sometimes I could get 8 or 10 TB transferred (I used straight CIFS, rsync, NFS, and others), and sometimes just a few GB. Sometimes the pool would pause and the transfer would continue for awhile, and then after a hard failure and reboot, resilver itself and heal for the amount of data I was able to transfer. Eventually I switched to SFF-8654 carrier cards and Silverstone active cooler carriages. Heat would not beat me! But still the same problem, nearly repeatable for every zpool flag, feature, anything I could switch on or off.
I placed thermal sensors on the drives and the Silverstone carriages are won-dee-ful. They were keeping the drives around 50 degrees, or lower. Heat was thus not a factor. So in a final effort to maintain allegiance to ZFS, I swapped the drives to a completely Samsung platform (980 Pro). Two things happened...one the number of AER messages got cut in half. Two, no more pool crashes and removed drives!!! All other things unchanged, that told me that some bit of chemistry between the CPU, the board, the drives, and ZFS was the problem. So I then tested on a SuperMicro X10SDV board - albeit with a single PCIe card, bifurcated x4x4x4x4 and running at PCIe gen 3.0 speeds. Nope, ZFS and the WD drives still broke. Samsung drives were A-ok. That's an intel board with a completely different IOMMU, AER, and ACS structure. So the final conclusion here, after all that testing is that ZFS pools, definitively when the pool uses a parity structure (RAIDz.*), are not compatible with current generation WD M.2 NVME drives. The OP's hypothesis of burst writes might be the culprit. FINALITY: With ZFS and mdadm cooked, I switched to RAID 5 BTRFS. Not a single problem. rsync transfer rates are 600MB/s from a pure SATA ZFS array of 24 x 2TB M.2 drives. That's less than what rsync reported on the ZFS pool, but it's also realistic. SMART load tests show 6000MB/s, on part for these drives. Nothing special, no unique flags for the BTRFS RAID 5 array, i don't even use commit=120. But I can copy hundreds of TB's back and forth with not a single problem. So here's what I know to be true:
5. Use BTRFS with these drives. The write-hole problem was fixed. |
Beta Was this translation helpful? Give feedback.
-
Did you all try latest ZFS versions? 2.2.8 or 2.3.3? Maybe you are hit by:
fixed by #16687 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Originally started as a bug, but after investigations and comments it is definitely more a hardware issue related to ZFS than a ZFS bug so I open a general discussion here, free feel to put constructive observations/ideas/workarounds/suggestions.
TL;DR: Some NVME sticks just crash with ZFS, probably due to the fact they are unable to sustain I/O bursts. It is not clear why this happens, the controller might just crash or a combination of firmware/BIOS/hardware makes it unstable/crash when used in a ZFS pool.
Hardware
Issue observed
My system zpool is composed of a single RAID-Z1 VDEV composed of 3x WD Black SN770 2TB them selves configured in 4K logical sectors (I did not test with 512b sectors to see if the issue still happens....yet). The VDEV uses LZ4 compression, is not encrypted neither the underlying modules (they do not support that), standard 128K stripes are used. No L2ARC cache used. System has plenty of free RAM so no RAM underpressure.
Under "normal" daily usage I did not experience anything, the zpool is regularly scrubbed and nothing to report: no checksum error, no frozen tasks, no crash, nothing, the pool completes all scrubbings wonderfully well. The machine also experience no freeze or kernel crashes/"oopses", no stuck tasks (I have had reported an issue with auditd here a couple of weeks ago but this guy is now inactive, see bug #14697). Even "emerging" big stuff like dev-qt/qtwebengine with 32 CMake jobs in parallel or reemerging the whole system from scratch with 32 parallel tasks with heavy packages rebuilt at the same time succeeds. No crashes.
However, if I use
zfs send
to make a backup of the system datasets on a local TrueNAS box over a 10GbE link this is another story: most of the time one of the NVMe modules randomly crash. The issues also happens at different times in the data transfer: sometimes the issue appears after 12Gb, sometimes after 78Gb, sometimes after 93 Gb and so on. If I am lucky, sometimes it completes the operation successfully (less than a quarter of the time). Itchy and annoying. I have managed also to reproduce it with rsync-ing a dataset on an empty new one in the same pool also this happens more rarely. The TrueNAS box and network are out of concern as they run smoothly and as I can reproduce the issue locally by sending the ZFS stream in /dev/null (zfs send .... | cat > /dev/null
).When the crash happens, the following trace appears in the kernel logs:
At this point, if I am lucky enough, I can manage to bring it back to life using a sledgehammer:
If the faulted device reappears the zpool becomes ONLINE again and completes its resilvering (a couple of KB or MB). In the worst case, another one NVMe also drops off the pool which becomes suspended so I have to powercycle the machine or push its reset button. Of course, doing a
nvme list
at this point either completely freezes either lists the two remaining NVMe modules, depending on what is alive.My best guess so far is that the Western Digital SN 770 modules controller is not not beefy enough to handle a burst of I/O requests (knowing they have no DRAM cache) so it is put on its knees and become so unresponsive that it is unable to complete a reset request on its own (no AER reported in logs BTW). As not always the same module crashes, they do not seems be all defective or I am extremely unlucky. Pool scrubbing might by a bit lighter for the controller so the scrubs/resilvers work without any issue (maximum observed speed observe is around 4.5~5 GB/s when scrubbing the pool according to
zpool status
).What has been tried so far
Several things! Without any improvements unfortunately:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
on the kernel command-line;zfs
kernel modules parameters: lowering values ofzfs_vdev_sync_read_min_active
,zfs_vdev_sync_read_max_active
and theirasync
counterpart (I used the same values set as defaults forfs_vdev_scrub_max_active
andfs_vdev_scrub_max_active
) ;throttle
:zfs send ... | throttle -M 300 | ...
blkio
cgroupzfs send
from a FreeBSD live media : FreeBSD allocates a 200MB host buffer for each module but unfortunately no more success and azfs send
also hangs :/Some thoughts / ideas of tests to try
Is there a "ZFS native" way to throttle I/O operations in the case of doing a
zfs send
?Has anybody here experienced something like this? If so, what are the other brands/models subject to a similar issue?
Beta Was this translation helpful? Give feedback.
All reactions