Skip to content

Conversation

@yizhanglinux
Copy link
Contributor

$./check nvme/050
nvme/050 => nvme1n1 (test nvme-pci timeout with fio jobs) [failed]
runtime 94.236s ... 62.734s
--- tests/nvme/050.out 2025-11-17 00:23:56.086469327 -0500
+++ /root/blktests/results/nvme1n1/nvme/050.out.bad 2025-11-19 03:17:45.389644408 -0500
@@ -1,2 +1,3 @@
Running nvme/050
-Test complete
+Test failed
+tests/nvme/050: line 50: /sys/bus/pci/devices//remove: Permission denied

$./check nvme/050
nvme/050 => nvme1n1 (test nvme-pci timeout with fio jobs)    [failed]
    runtime  94.236s  ...  62.734s
    --- tests/nvme/050.out	2025-11-17 00:23:56.086469327 -0500
    +++ /root/blktests/results/nvme1n1/nvme/050.out.bad	2025-11-19 03:17:45.389644408 -0500
    @@ -1,2 +1,3 @@
     Running nvme/050
    -Test complete
    +Test failed
    +tests/nvme/050: line 50: /sys/bus/pci/devices//remove: Permission denied

Signed-off-by: Yi Zhang <[email protected]>
@igaw
Copy link
Contributor

igaw commented Nov 25, 2025

There are enterprise PCI disks with multipath. I think it would be good to tests these devices as well. It seems you are testing with such a device, maybe we could get this tests also working with these types?

I think it it's possible to make _get_pci_dev_from_blkdev a bit smarter so it returns the correct PCI device?

kawasaki added a commit to kawasaki/blktests that referenced this pull request Jan 3, 2026
The test case nvme/032 sets the value "pci" to the global variable
nvme_trtype to ensure that the test case runs only when TEST_DEV is a
NVME device using PCI transport. However, this approach was not working
as intended since the global variable is not referred to. The test case
was run for NVME devices using non-PCI transport, and reported false-
positive failures.

Commit c634b8a ("nvme/032: skip on non-PCI devices") introduced the
helper function _require_test_dev_is_nvme_pci(). This function ensures
that the test case nvme/032 is skipped when TEST_DEV is not a NVME
device with PCI transport. Despite this improvement, the unused global
variable nvme_trtype continued to be set. Remove the unnecessary
substitution code.

In the same manner, the test case nvme/050 is expected to be run only
when TEST_DEV is a NVME device with PCI transport. It also sets the
global variable nvme_trtype, but it caused unexpected failure as
reported in the Link. Modify the test case to use
_require_test_dev_is_nvme_pci() to ensure the requirement.

Fixes: c634b8a ("nvme/032: skip on non-PCI devices")
Link: linux-blktests#214
Signed-off-by: Shin'ichiro Kawasaki <[email protected]>
@kawasaki
Copy link
Collaborator

kawasaki commented Jan 3, 2026

IIUC what @igaw suggests,

  • nvme/050 requires TEST_DEV which is NVME with PCI transport, and,
  • nvme/050 does not require TEST_DEV to not have multipath

Assuming this guess it correct, I created a patch. It just checks that TEST_DEV has PCI transport.

@yizhanglinux , does this patch avoid the failure you face?

@yizhanglinux
Copy link
Contributor Author

@kawasaki This can skip the test when the disk is an enterprise PCI disk with multipath.
I think what @igaw suggests is we still need to test such enterprise pci disk with multiapth.

Maybe something like below to return the PCI device when the disk supports multipath:

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f3564..ba09956 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -25,8 +25,7 @@ test_device() {
        local i

        echo "Running ${TEST_NAME}"
-
-       pdev=$(_get_pci_dev_from_blkdev)
+       pdev=$(_nvme_get_pci_from_dev_sysfs)
        nvme_ns="$(basename "${TEST_DEV}")"
        echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

diff --git a/tests/nvme/rc b/tests/nvme/rc
index a8f80d8..e25cda2 100644
--- a/tests/nvme/rc
+++ b/tests/nvme/rc
@@ -87,6 +87,24 @@ _require_test_dev_is_not_nvme_multipath() {
        return 0
 }

+_nvme_dev_support_native_multipath() {
+       if [[ "$(readlink -f "$TEST_DEV_SYSFS/device")" =~ /nvme-subsystem/ ]]; then
+               return 0
+       fi
+       return 1
+}
+
+_nvme_get_pci_from_dev_sysfs() {
+       if _nvme_dev_support_native_multipath; then
+               readlink -f /sys/block/$(basename "${TEST_DEV}")/multipath/nvme*c*n*/device | \
+                       grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | \
+                       tail -1
+       else
+               readlink -f "$TEST_DEV_SYSFS/device" | \
+                       grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | \
+                       tail -1
+       fi
+}
 _require_test_dev_support_sed() {
        if ! nvme sed discover "$TEST_DEV" &> /dev/null; then
                SKIP_REASONS+=("$TEST_DEV doesn't support SED operations")

@kawasaki
Copy link
Collaborator

kawasaki commented Jan 4, 2026

@yizhanglinux Thanks for the clarification. Now I have better understanding.

@igaw Please comment if the suggested change suits your comment.

As to the change by @yizhanglinux , I have a few comments:

  • The readolink short option "-f" is to be replaced with the long option "--canonicalize".
  • In _nvme_get_pci_from_dev_sysfs(), "/sys/block/$(basename "${TEST_DEV}")" can be replaced with "${TEST_DEV_SYSFS}.
  • In _nvme_get_pci_from_dev_sysfs(), the "else" block can be replaced with a call to _get_pci_dev_from_blkdev(), probably.

@yizhanglinux
Copy link
Contributor Author

@yizhanglinux Thanks for the clarification. Now I have better understanding.

@igaw Please comment if the suggested change suits your comment.

As to the change by @yizhanglinux , I have a few comments:

  • The readolink short option "-f" is to be replaced with the long option "--canonicalize".

OK, we replace all the "-f" with "--canonicalize" for all files in one patch

  • In _nvme_get_pci_from_dev_sysfs(), "/sys/block/$(basename "${TEST_DEV}")" can be replaced with "${TEST_DEV_SYSFS}.
  • In _nvme_get_pci_from_dev_sysfs(), the "else" block can be replaced with a call to _get_pci_dev_from_blkdev(), probably.

How about the below change:

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f3564..b6eba8b 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -26,7 +26,7 @@ test_device() {

        echo "Running ${TEST_NAME}"

-       pdev=$(_get_pci_dev_from_blkdev)
+       pdev=$(_nvme_get_pci_from_dev_sysfs)
        nvme_ns="$(basename "${TEST_DEV}")"
        echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

diff --git a/tests/nvme/rc b/tests/nvme/rc
index a8f80d8..9314671 100644
--- a/tests/nvme/rc
+++ b/tests/nvme/rc
@@ -87,6 +87,25 @@ _require_test_dev_is_not_nvme_multipath() {
        return 0
 }

+_nvme_dev_support_native_multipath() {
+       if [[ "$(readlink -f "$TEST_DEV_SYSFS/device")" =~ /nvme-subsystem/ ]]; then
+               return 0
+       fi
+       return 1
+}
+
+_nvme_get_pci_from_dev_sysfs() {
+       if _nvme_dev_support_native_multipath; then
+               readlink -f $TEST_DEV_SYSFS/multipath/nvme*c*n* | \
+                       grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | \
+                       tail -1
+       else
+              _get_pci_dev_from_blkdev
+       fi
+}
+

@igaw
Copy link
Contributor

igaw commented Jan 5, 2026

Yes, this looks like what I had in mind. Thanks for taking care!

@kawasaki
Copy link
Collaborator

kawasaki commented Jan 5, 2026

@yizhanglinux Thanks. Overall, the suggested change looks good. Posting it as a proper patch or PR will be appreciated. Again, please use "--canonicalize" instead of "-f". Also, I think it is good to have another patch to replace "-f" with "--canonicalize" for other places, which can be done in the same PR/series, or later.

@yizhanglinux
Copy link
Contributor Author

@igaw @kawasaki

I just found the case still failed due to no "Input/output error" output from dmesg, and also no error log output [2].

[1]

	nvme_ns="$(basename "${TEST_DEV}")"
	echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail

	echo 100 > /sys/kernel/debug/fail_io_timeout/probability
	echo   1 > /sys/kernel/debug/fail_io_timeout/interval
	echo  -1 > /sys/kernel/debug/fail_io_timeout/times
	echo   0 > /sys/kernel/debug/fail_io_timeout/space
	echo   1 > /sys/kernel/debug/fail_io_timeout/verbose

	fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
	    --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
	    --time_based --runtime=1m >& "$FULL"

	if grep -q "Input/output error" "$FULL"; then
		echo "Test complete"
	else
		echo "Test failed"
	fi

[2]

# ./check nvme/050
nvme/050 => nvme0n1 (test nvme-pci timeout with fio jobs)    [failed]
    runtime    ...  62.913s
    --- tests/nvme/050.out	2026-01-05 01:05:11.924877002 -0500
    +++ /root/blktests/results/nvme0n1/nvme/050.out.bad	2026-01-05 07:41:41.764764187 -0500
    @@ -1,2 +1,2 @@
     Running nvme/050
    -Test complete
    +Test failed
nvme/050 => nvme3n1 (test nvme-pci timeout with fio jobs)    [failed]
    runtime  63.098s  ...  63.110s
    --- tests/nvme/050.out	2026-01-05 01:05:11.924877002 -0500
    +++ /root/blktests/results/nvme3n1/nvme/050.out.bad	2026-01-05 07:42:46.482290283 -0500
    @@ -1,2 +1,2 @@
     Running nvme/050
    -Test complete
    +Test failed
# dmesg
[ 7090.147822] run blktests nvme/050 at 2026-01-05 07:40:40
[ 7152.050970] pci 0000:41:00.0: [144d:a826] type 00 class 0x010802 PCIe Endpoint
[ 7152.051113] pci 0000:41:00.0: BAR 0 [mem 0xa4a00000-0xa4a07fff 64bit]
[ 7152.467124] pci 0000:41:00.0: VF BAR 0 [mem 0x00000000-0x00007fff 64bit]
[ 7152.467147] pci 0000:41:00.0: VF BAR 0 [mem 0x00000000-0x001fffff 64bit]: contains BAR 0 for 64 VFs
[ 7152.467907] pci 0000:41:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:40:03.1 (capable of 126.028 Gb/s with 32.0 GT/s PCIe x4 link)
[ 7152.604706] pci 0000:41:00.0: Adding to iommu group 48
[ 7152.641542] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7152.641558] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7152.641664] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7152.641675] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7152.641864] pci 0000:41:00.0: BAR 0 [mem 0xa4a00000-0xa4a07fff 64bit]: assigned
[ 7152.641895] pci 0000:41:00.0: VF BAR 0 [mem size 0x00200000 64bit]: can't assign; no space
[ 7152.641906] pci 0000:41:00.0: VF BAR 0 [mem size 0x00200000 64bit]: failed to assign
[ 7152.656355] nvme nvme0: pci function 0000:41:00.0
[ 7152.753246] nvme nvme0: D3 entry latency set to 10 seconds
[ 7152.865881] nvme nvme0: 16/0/0 default/read/poll queues
[ 7154.555876] run blktests nvme/050 at 2026-01-05 07:41:44
[ 7216.661760] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7216.661779] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7216.661888] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: can't assign; no space
[ 7216.661898] pcieport 0000:40:03.1: bridge window [io  size 0x1000]: failed to assign
[ 7216.663032] pci 0000:c7:00.0: [1e0f:0013] type 00 class 0x010802 PCIe Endpoint
[ 7216.663162] pci 0000:c7:00.0: BAR 0 [mem 0xdb500000-0xdb50ffff 64bit]
[ 7216.663238] pci 0000:c7:00.0: ROM [mem 0xffff0000-0xffffffff pref]
[ 7217.083827] pci 0000:c7:00.0: VF BAR 0 [mem 0x00000000-0x0000ffff 64bit]
[ 7217.083849] pci 0000:c7:00.0: VF BAR 0 [mem 0x00000000-0x001fffff 64bit]: contains BAR 0 for 32 VFs
[ 7217.084456] pci 0000:c7:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:c0:03.3 (capable of 126.028 Gb/s with 32.0 GT/s PCIe x4 link)
[ 7217.217478] pci 0000:c7:00.0: Adding to iommu group 13
[ 7217.253368] pci 0000:c7:00.0: BAR 0 [mem 0xdb500000-0xdb50ffff 64bit]: assigned
[ 7217.253404] pci 0000:c7:00.0: ROM [mem 0xdb510000-0xdb51ffff pref]: assigned
[ 7217.253416] pci 0000:c7:00.0: VF BAR 0 [mem size 0x00200000 64bit]: can't assign; no space
[ 7217.253428] pci 0000:c7:00.0: VF BAR 0 [mem size 0x00200000 64bit]: failed to assign
[ 7217.268252] nvme nvme3: pci function 0000:c7:00.0
[ 7217.502216] nvme nvme3: D3 entry latency set to 10 seconds
[ 7217.610849] nvme nvme3: 16/0/0 default/read/poll queues

@igaw
Copy link
Contributor

igaw commented Jan 5, 2026

First, is the line below correct now?

 echo 1 > "/sys/bus/pci/devices/${pdev}/remove"

If so, then there is another problem. But from the kernel logs, it looks like the remove works.

@igaw
Copy link
Contributor

igaw commented Jan 5, 2026

Actually, I wonder why the fio jobs is supposed to succeed at all. The device is removed on PCI level -> block device should also be gone. It's not a reset where the block layer should not see any device remove/add operation.

@igaw
Copy link
Contributor

igaw commented Jan 5, 2026

Ah wait, the test is expecting that fio is failing but it doesn't. And the test doesn't fail because to the nature of the multipath device. In this configuration the head nvme device might not be removed and the block layer buffers the IOs until the driver is ready again. Though I am not totally sure how nvme-pci works here. Need to check the source.

Could you check the output of fio? If it doesn't fail, then it is very likely all in flight IOs are buffered at the block layer. If so, than we have to figure out what we want to test here.

@yizhanglinux
Copy link
Contributor Author

First, is the line below correct now?

 echo 1 > "/sys/bus/pci/devices/${pdev}/remove"

If so, then there is another problem. But from the kernel logs, it looks like the remove works.

Yes, the disk was removed and initialized again after pci rescan.

@yizhanglinux
Copy link
Contributor Author

The fio pass with no errors from the full log.

# cat results/nvme0n1/nvme/050.full
reads: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.36
Starting 16 processes

reads: (groupid=0, jobs=16): err= 0: pid=33838: Mon Jan  5 10:14:28 2026
  read: IOPS=4525, BW=17.7MiB/s (18.5MB/s)(1061MiB/60005msec)
    clat (usec): min=784, max=86139, avg=3533.19, stdev=815.08
     lat (usec): min=784, max=86139, avg=3533.39, stdev=815.09
    clat percentiles (usec):
     |  1.00th=[ 2507],  5.00th=[ 2769], 10.00th=[ 2933], 20.00th=[ 3097],
     | 30.00th=[ 3195], 40.00th=[ 3294], 50.00th=[ 3359], 60.00th=[ 3458],
     | 70.00th=[ 3556], 80.00th=[ 3687], 90.00th=[ 4293], 95.00th=[ 5080],
     | 99.00th=[ 6587], 99.50th=[ 7177], 99.90th=[ 8848], 99.95th=[ 9503],
     | 99.99th=[12780]
   bw (  KiB/s): min=16118, max=19067, per=100.00%, avg=18111.24, stdev=28.86, samples=1904
   iops        : min= 4023, max= 4760, avg=4521.34, stdev= 7.24, samples=1904
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.04%, 4=86.78%, 10=13.14%, 20=0.03%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.05%, sys=19.27%, ctx=291366, majf=0, minf=215
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=271550,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=17.7MiB/s (18.5MB/s), 17.7MiB/s-17.7MiB/s (18.5MB/s-18.5MB/s), io=1061MiB (1112MB), run=60005-60005msec

Disk stats (read/write):
  nvme0n1: ios=270757/0, sectors=2166056/0, merge=0/0, ticks=412618/0, in_queue=412618, util=99.26%

@igaw
Copy link
Contributor

igaw commented Jan 7, 2026

Ah I remember what's happening with the PCI device removal and the handling of the nvme head device. When we have a mulitpath device, the life time of the nvme head device is coupled to the PCI subsystem hotplug behavior. That is nvme_remove might not be executed thus the nvme head device will not be removed (depends on the hardware hotplug support of the architecture). Only the nvme devices which represent the paths are removed.

As shown, fio will not fail for such configurations as the block layer will handle the requeue the failing IOs instead reporting an error to the upper layers. So this tests is written is for single path devices.

This means we could disable it for multipath nvme devices or better extend it and expect no IO errors when it is an multipath device. I'd prefer the second approach. WDYT?

EDIT: note, the behavior could be different on different architecture. Though I think it would be good to collect this information and document it. We could make the tests considering also the architecture if necessary. Or recent kernel have addressed this problem after all :)

@kawasaki
Copy link
Collaborator

kawasaki commented Jan 8, 2026

@igaw Thanks for the clarification.

  • The explanation about the device removal dependency between PCI device and NVME device was interesting for me (multipath capability and architecture are relevant!). and I think it's worth documentation, somewhere under driver/nvme//.c as a block comment or Documentation/nvme/*.
  • Also, it's worth testing, then I think that the ultimate goal of this test case should be "extend it and expect no IO errors when it is an multipath device".
  • If it takes time to reach to the ultimate goal, I think it's the better to take two steps approach to suppress the error that @yizhanglinux is facing soon: 1) disable it for multipath nvme devices, 2) extend to check no IO errors for multipath devices

@igaw
Copy link
Contributor

igaw commented Jan 8, 2026

FWIW, a nvme-pci multipath behaves similar to a fabrics device. The paths can go away and as long the ctrl loss timeout doesn't expire (or in this case pci device remove) the block device will be around.

I had something like this in mind:

diff --git a/tests/nvme/050 b/tests/nvme/050
index 91f356422f63..4320c00d0a81 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -19,10 +19,22 @@ requires() {
 	_have_kernel_options FAIL_IO_TIMEOUT FAULT_INJECTION_DEBUG_FS
 }
 
+is_multipath_device() {
+	local nvme_ns cmic
+
+	nvme_ns="$1"
+
+	cmic="$(nvme id-ctrl "$nvme_ns" --output-format=json | jq -r '.cmic')"
+
+	if (( cmic & 0x1 )); then
+		return 0
+	fi
+
+	return 1
+}
+
 test_device() {
-	local nvme_ns
-	local pdev
-	local i
+	local nvme_ns pdev io_error i
 
 	echo "Running ${TEST_NAME}"
 
@@ -40,10 +52,13 @@ test_device() {
 	    --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
 	    --time_based --runtime=1m >& "$FULL"
 
-	if grep -q "Input/output error" "$FULL"; then
-		echo "Test complete"
+	io_error=false
+	grep -q "Input/output error" "$FULL" && io_error=true
+
+	if is_multipath_device "$nvme_ns"; then
+		$io_error && echo "Test complete" || echo "Test failed"
 	else
-		echo "Test failed"
+		$io_error && echo "Test failed" || echo "Test complete"
 	fi
 
 	# Remove and rescan the NVME device to ensure that it has come back

or if we don't want to trust that when a cmic bit is set it's multipath device then we should check what the sysfs is telling us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants