Replies: 3 comments 5 replies
-
Also, you don't have to remove the disk for non-destructive badblocks run.
It will just read the data once. But in case of a bad cable between the
disk and the motherboard it will not help you much.
On Tue, Aug 30, 2022 at 2:05 PM Ivan Volosyuk ***@***.***>
wrote:
… This looks like cabling might be a problem (UDMA_CRC_Error_Count), not
the surface of the disk.
On Tue, Aug 30, 2022 at 4:30 AM Jon Crall ***@***.***>
wrote:
> I'm in a situation where I have a RAID-10 zpool that is reporting
> checksum errors on one of the disks.
>
> I'm trying to debug it, but something making this difficult is that I
> need my machine to be online because it's crunching some important numbers
> right now, so I can't schedule any downtime for another week or so.
>
> My next step is I want to run badblocks in non-destructive mode, but when
> I tried:
>
> DEV=/dev/sda
> # Parse out the correct blocksize for the disk in question.
> BLOCK_SIZE=$(lsblk -o NAME,PHY-SeC,type $DEV --json | jq -r '.blockdevices[0]["phy-sec"]')
> echo "BLOCK_SIZE = $BLOCK_SIZE"
> BLOCKS_PER_TEST=1024
> sudo badblocks -nsv -b "$BLOCK_SIZE" -c "$BLOCKS_PER_TEST" "$DEV"
>
> I got:
>
> /dev/sda is apparently in use by the system; it's not safe to run badblocks!
>
> My question is: is it possible to temporarily remove the disk with
> checksum errors, run badblocks on it, and then re-add it to the pool?
> Steps Taken So Far Scrubs and Status
>
> When I first noticed the errors, I just did a zpool clear data (and I
> regret naming my zpool data, because running the way that command reads
> scares me), and then ran a second scrub. But the checksum errors persisted.
>
> zpool status
>
> pool: data
> state: DEGRADED
> status: One or more devices has experienced an unrecoverable error. An
> attempt was made to correct the error. Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> using 'zpool clear' or replace the device with 'zpool replace'.
> see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
> scan: scrub repaired 36.8M in 13:45:40 with 0 errors on Sun Aug 28 16:12:25 2022
> config:
>
> NAME STATE READ WRITE CKSUM
> data DEGRADED 0 0 0
> mirror-0 DEGRADED 0 0 0
> wwn-0x5000c5009399acab DEGRADED 94 0 259 too many errors
> wwn-0x5000c500a4d78d92 ONLINE 0 0 0
> mirror-1 ONLINE 0 0 0
> wwn-0x5000c500a4e45aa5 ONLINE 0 0 0
> wwn-0x5000c500a3d4e682 ONLINE 0 0 0
> cache
> nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0RB05113H ONLINE 0 0 0
>
> errors: No known data errors
>
> I don't have ECC memory and running a full memtest is still on my TODO
> list when I can poweroff the machine for a few hours.
> SMARTCTL Extended Tests
>
> In the meantime there are other tests I can do. I ran a full extend SMART
> test, but that returned no issues.
>
> sudo smartctl -a /dev/disk/by-id/wwn-0x5000c5009399acab
>
> smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-46-generic] (local build)
> Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Seagate BarraCuda 3.5
> Device Model: ST10000DM0004-1ZC101
> Serial Number: ZA20VNPT
> LU WWN Device Id: 5 000c50 09399acab
> Firmware Version: DN01
> User Capacity: 10,000,831,348,736 bytes [10.0 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 7200 rpm
> Form Factor: 3.5 inches
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-3 T13/2161-D revision 5
> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
> Local Time is: Mon Aug 29 14:20:26 2022 EDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: ( 575) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 907) minutes.
> Conveyance self-test routine
> recommended polling time: ( 2) minutes.
> SCT capabilities: (0x50bd) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 064 064 044 Pre-fail Always - 2762760
> 3 Spin_Up_Time 0x0003 089 089 000 Pre-fail Always - 0
> 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 18
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x000f 084 060 045 Pre-fail Always - 281197730
> 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3458 (38 95 0)
> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 18
> 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
> 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 14 14 14
> 189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
> 190 Airflow_Temperature_Cel 0x0022 070 063 040 Old_age Always - 30 (Min/Max 28/35)
> 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 609
> 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8
> 193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 6200
> 194 Temperature_Celsius 0x0022 030 040 000 Old_age Always - 30 (0 24 0 0 0)
> 195 Hardware_ECC_Recovered 0x001a 030 002 000 Old_age Always - 2762760
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 200 196 000 Old_age Always - 483
> 200 Pressure_Limit 0x0023 100 100 001 Pre-fail Always - 0
> 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1110h+53m+10.289s
> 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 19614334254
> 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 42413745764
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed without error 00% 3453 -
> # 2 Short offline Completed without error 00% 3414 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> DMESG
>
> I don't really know what I'm reading, but I've seen posts saying that
> dmesg contains useful debugging information. I think the relevant section
> for this error is:
>
> [Aug28 14:00] ata3.00: exception Emask 0x10 SAct 0x40000220 SErr 0x280100 action 0x6 frozen
> [ +0.000020] ata3.00: irq_stat 0x08000000, interface fatal error
> [ +0.000004] ata3: SError: { UnrecovData 10B8B BadCRC }
> [ +0.000008] ata3.00: failed command: READ FPDMA QUEUED
> [ +0.000005] ata3.00: cmd 60/00:28:38:92:18/01:00:0a:04:00/40 tag 5 ncq dma 131072 in
> res 40/00:f0:38:8a:18/00:00:0a:04:00/40 Emask 0x10 (ATA bus error)
> [ +0.000015] ata3.00: status: { DRDY }
> [ +0.000006] ata3.00: failed command: READ FPDMA QUEUED
> [ +0.000003] ata3.00: cmd 60/00:48:38:82:18/08:00:0a:04:00/40 tag 9 ncq dma 1048576 in
> res 40/00:f0:38:8a:18/00:00:0a:04:00/40 Emask 0x10 (ATA bus error)
> [ +0.000013] ata3.00: status: { DRDY }
> [ +0.000004] ata3.00: failed command: READ FPDMA QUEUED
> [ +0.000003] ata3.00: cmd 60/00:f0:38:8a:18/08:00:0a:04:00/40 tag 30 ncq dma 1048576 in
> res 40/00:f0:38:8a:18/00:00:0a:04:00/40 Emask 0x10 (ATA bus error)
> [ +0.000011] ata3.00: status: { DRDY }
> [ +0.000007] ata3: hard resetting link
> [ +0.314835] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [ +0.036454] ata3.00: configured for UDMA/133
> [ +0.000033] sd 2:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> [ +0.000007] sd 2:0:0:0: [sda] tag#5 Sense Key : Illegal Request [current]
> [ +0.000005] sd 2:0:0:0: [sda] tag#5 Add. Sense: Unaligned write command
> [ +0.000005] sd 2:0:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 04 0a 18 92 38 00 00 01 00 00 00
> [ +0.000003] blk_update_request: I/O error, dev sda, sector 17349251640 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
> [ +0.000017] zio pool=data vdev=/dev/disk/by-id/wwn-0x5000c5009399acab-part1 error=5 type=1 offset=8882815791104 size=131072 flags=1808b0
> [ +0.000031] sd 2:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> [ +0.000003] sd 2:0:0:0: [sda] tag#9 Sense Key : Illegal Request [current]
> [ +0.000005] sd 2:0:0:0: [sda] tag#9 Add. Sense: Unaligned write command
> [ +0.000003] sd 2:0:0:0: [sda] tag#9 CDB: Read(16) 88 00 00 00 00 04 0a 18 82 38 00 00 08 00 00 00
> [ +0.000003] blk_update_request: I/O error, dev sda, sector 17349247544 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
> [ +0.000010] zio pool=data vdev=/dev/disk/by-id/wwn-0x5000c5009399acab-part1 error=5 type=1 offset=8882813693952 size=1048576 flags=40080cb0
> [ +0.000018] sd 2:0:0:0: [sda] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> [ +0.000004] sd 2:0:0:0: [sda] tag#30 Sense Key : Illegal Request [current]
> [ +0.000004] sd 2:0:0:0: [sda] tag#30 Add. Sense: Unaligned write command
> [ +0.000003] sd 2:0:0:0: [sda] tag#30 CDB: Read(16) 88 00 00 00 00 04 0a 18 8a 38 00 00 08 00 00 00
> [ +0.000003] blk_update_request: I/O error, dev sda, sector 17349249592 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
> [ +0.000006] zio pool=data vdev=/dev/disk/by-id/wwn-0x5000c5009399acab-part1 error=5 type=1 offset=8882814742528 size=1048576 flags=40080cb0
> [ +0.000011] ata3: EH complete
>
> And if I'm reading it right, its pointing our which sectors on the disk
> are having trouble. I also read you could test sectors with hdparam, so
> I ran:
>
> sudo hdparm --read-sector 17349251640 /dev/sda
> sudo hdparm --read-sector 17349247544 /dev/sda
>
> It seem to succeed without error, so either I did it wrong, or that isn't
> the error.
> Summary
>
> So now I'm in the situation where zpool status says /dev/sda has a
> checksum error, smartctl says everything is ok, dmesg is flagging some
> sectors, but hdparam says they are fine.
>
> The next thing I want to try is running badblocks in non-destructive
> mode, but to do that I need to remove the zfs device first, and I'm unsure
> about how to do that correctly in this use-case.
>
> I will say that all of the data I'm working with is backed up or
> recomputable.
>
> —
> Reply to this email directly, view it on GitHub
> <#13812>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABXQ6HPGVEGBR3POS5WXYH3V3T6WDANCNFSM576Y4DYQ>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
Beta Was this translation helpful? Give feedback.
3 replies
-
My bad. Rich is right. There is a difference between the
non-destructive and read-only mode. One reads and writes back, the
other only reads. It should be safe to read, but with writes involved
there would be a potential race condition with ZFS which can corrupt
the information written. Running bad blocks will not help much as if
the problem is in electronics: controllers or cables - the errors can
occur in random places on the disk. Hard to tell what this can be -
vibration, loose connection, bad cable. The easiest would be to
replace the cable and watch for more errors in the raw value of the
S.M.A.R.T field.
…On Wed, Aug 31, 2022 at 5:55 AM Rich Ercolani ***@***.***> wrote:
Non-destructive RW would not be safe with the disk in use. RO is, though.
And yeah, UDMA_CRC_Error_Count is (almost?) exclusively IME bad cabling or controller, not disk surface issues.
You may also find #13801 interesting.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
2 replies
-
This looks like cabling might be a problem (UDMA_CRC_Error_Count), not the
surface of the disk.
…On Tue, Aug 30, 2022 at 4:30 AM Jon Crall ***@***.***> wrote:
I'm in a situation where I have a RAID-10 zpool that is reporting checksum
errors on one of the disks.
I'm trying to debug it, but something making this difficult is that I need
my machine to be online because it's crunching some important numbers right
now, so I can't schedule any downtime for another week or so.
My next step is I want to run badblocks in non-destructive mode, but when
I tried:
DEV=/dev/sda
# Parse out the correct blocksize for the disk in question.
BLOCK_SIZE=$(lsblk -o NAME,PHY-SeC,type $DEV --json | jq -r '.blockdevices[0]["phy-sec"]')
echo "BLOCK_SIZE = $BLOCK_SIZE"
BLOCKS_PER_TEST=1024
sudo badblocks -nsv -b "$BLOCK_SIZE" -c "$BLOCKS_PER_TEST" "$DEV"
I got:
/dev/sda is apparently in use by the system; it's not safe to run badblocks!
My question is: is it possible to temporarily remove the disk with
checksum errors, run badblocks on it, and then re-add it to the pool?
Steps Taken So Far Scrubs and Status
When I first noticed the errors, I just did a zpool clear data (and I
regret naming my zpool data, because running the way that command reads
scares me), and then ran a second scrub. But the checksum errors persisted.
zpool status
pool: data
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 36.8M in 13:45:40 with 0 errors on Sun Aug 28 16:12:25 2022
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
wwn-0x5000c5009399acab DEGRADED 94 0 259 too many errors
wwn-0x5000c500a4d78d92 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
wwn-0x5000c500a4e45aa5 ONLINE 0 0 0
wwn-0x5000c500a3d4e682 ONLINE 0 0 0
cache
nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0RB05113H ONLINE 0 0 0
errors: No known data errors
I don't have ECC memory and running a full memtest is still on my TODO
list when I can poweroff the machine for a few hours.
SMARTCTL Extended Tests
In the meantime there are other tests I can do. I ran a full extend SMART
test, but that returned no issues.
sudo smartctl -a /dev/disk/by-id/wwn-0x5000c5009399acab
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-46-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate BarraCuda 3.5
Device Model: ST10000DM0004-1ZC101
Serial Number: ZA20VNPT
LU WWN Device Id: 5 000c50 09399acab
Firmware Version: DN01
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Mon Aug 29 14:20:26 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 907) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 064 064 044 Pre-fail Always - 2762760
3 Spin_Up_Time 0x0003 089 089 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 18
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 084 060 045 Pre-fail Always - 281197730
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3458 (38 95 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 18
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 14 14 14
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 070 063 040 Old_age Always - 30 (Min/Max 28/35)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 609
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 6200
194 Temperature_Celsius 0x0022 030 040 000 Old_age Always - 30 (0 24 0 0 0)
195 Hardware_ECC_Recovered 0x001a 030 002 000 Old_age Always - 2762760
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 196 000 Old_age Always - 483
200 Pressure_Limit 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1110h+53m+10.289s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 19614334254
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 42413745764
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 3453 -
# 2 Short offline Completed without error 00% 3414 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
DMESG
I don't really know what I'm reading, but I've seen posts saying that
dmesg contains useful debugging information. I think the relevant section
for this error is:
[Aug28 14:00] ata3.00: exception Emask 0x10 SAct 0x40000220 SErr 0x280100 action 0x6 frozen
[ +0.000020] ata3.00: irq_stat 0x08000000, interface fatal error
[ +0.000004] ata3: SError: { UnrecovData 10B8B BadCRC }
[ +0.000008] ata3.00: failed command: READ FPDMA QUEUED
[ +0.000005] ata3.00: cmd 60/00:28:38:92:18/01:00:0a:04:00/40 tag 5 ncq dma 131072 in
res 40/00:f0:38:8a:18/00:00:0a:04:00/40 Emask 0x10 (ATA bus error)
[ +0.000015] ata3.00: status: { DRDY }
[ +0.000006] ata3.00: failed command: READ FPDMA QUEUED
[ +0.000003] ata3.00: cmd 60/00:48:38:82:18/08:00:0a:04:00/40 tag 9 ncq dma 1048576 in
res 40/00:f0:38:8a:18/00:00:0a:04:00/40 Emask 0x10 (ATA bus error)
[ +0.000013] ata3.00: status: { DRDY }
[ +0.000004] ata3.00: failed command: READ FPDMA QUEUED
[ +0.000003] ata3.00: cmd 60/00:f0:38:8a:18/08:00:0a:04:00/40 tag 30 ncq dma 1048576 in
res 40/00:f0:38:8a:18/00:00:0a:04:00/40 Emask 0x10 (ATA bus error)
[ +0.000011] ata3.00: status: { DRDY }
[ +0.000007] ata3: hard resetting link
[ +0.314835] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ +0.036454] ata3.00: configured for UDMA/133
[ +0.000033] sd 2:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ +0.000007] sd 2:0:0:0: [sda] tag#5 Sense Key : Illegal Request [current]
[ +0.000005] sd 2:0:0:0: [sda] tag#5 Add. Sense: Unaligned write command
[ +0.000005] sd 2:0:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 04 0a 18 92 38 00 00 01 00 00 00
[ +0.000003] blk_update_request: I/O error, dev sda, sector 17349251640 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
[ +0.000017] zio pool=data vdev=/dev/disk/by-id/wwn-0x5000c5009399acab-part1 error=5 type=1 offset=8882815791104 size=131072 flags=1808b0
[ +0.000031] sd 2:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ +0.000003] sd 2:0:0:0: [sda] tag#9 Sense Key : Illegal Request [current]
[ +0.000005] sd 2:0:0:0: [sda] tag#9 Add. Sense: Unaligned write command
[ +0.000003] sd 2:0:0:0: [sda] tag#9 CDB: Read(16) 88 00 00 00 00 04 0a 18 82 38 00 00 08 00 00 00
[ +0.000003] blk_update_request: I/O error, dev sda, sector 17349247544 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
[ +0.000010] zio pool=data vdev=/dev/disk/by-id/wwn-0x5000c5009399acab-part1 error=5 type=1 offset=8882813693952 size=1048576 flags=40080cb0
[ +0.000018] sd 2:0:0:0: [sda] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ +0.000004] sd 2:0:0:0: [sda] tag#30 Sense Key : Illegal Request [current]
[ +0.000004] sd 2:0:0:0: [sda] tag#30 Add. Sense: Unaligned write command
[ +0.000003] sd 2:0:0:0: [sda] tag#30 CDB: Read(16) 88 00 00 00 00 04 0a 18 8a 38 00 00 08 00 00 00
[ +0.000003] blk_update_request: I/O error, dev sda, sector 17349249592 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0
[ +0.000006] zio pool=data vdev=/dev/disk/by-id/wwn-0x5000c5009399acab-part1 error=5 type=1 offset=8882814742528 size=1048576 flags=40080cb0
[ +0.000011] ata3: EH complete
And if I'm reading it right, its pointing our which sectors on the disk
are having trouble. I also read you could test sectors with hdparam, so I
ran:
sudo hdparm --read-sector 17349251640 /dev/sda
sudo hdparm --read-sector 17349247544 /dev/sda
It seem to succeed without error, so either I did it wrong, or that isn't
the error.
Summary
So now I'm in the situation where zpool status says /dev/sda has a
checksum error, smartctl says everything is ok, dmesg is flagging some
sectors, but hdparam says they are fine.
The next thing I want to try is running badblocks in non-destructive mode,
but to do that I need to remove the zfs device first, and I'm unsure about
how to do that correctly in this use-case.
I will say that all of the data I'm working with is backed up or
recomputable.
—
Reply to this email directly, view it on GitHub
<#13812>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXQ6HPGVEGBR3POS5WXYH3V3T6WDANCNFSM576Y4DYQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm in a situation where I have a RAID-10 zpool that is reporting checksum errors on one of the disks.
I'm trying to debug it, but something making this difficult is that I need my machine to be online because it's crunching some important numbers right now, so I can't schedule any downtime for another week or so.
My next step is I want to run badblocks in non-destructive mode, but when I tried:
I got:
My question is: is it possible to temporarily remove the disk with checksum errors, run badblocks on it, and then re-add it to the pool?
Steps Taken So Far
Scrubs and Status
When I first noticed the errors, I just did a
zpool clear data
(and I regret naming my zpool data, because running the way that command reads scares me), and then ran a second scrub. But the checksum errors persisted.zpool status
I don't have ECC memory and running a full memtest is still on my TODO list when I can poweroff the machine for a few hours.
SMARTCTL Extended Tests
In the meantime there are other tests I can do. I ran a full extend SMART test, but that returned no issues.
sudo smartctl -a /dev/disk/by-id/wwn-0x5000c5009399acab
DMESG
I don't really know what I'm reading, but I've seen posts saying that dmesg contains useful debugging information. I think the relevant section for this error is:
And if I'm reading it right, its pointing our which sectors on the disk are having trouble. I also read you could test sectors with
hdparam
, so I ran:It seem to succeed without error, so either I did it wrong, or that isn't the error.
Summary
So now I'm in the situation where zpool status says /dev/sda has a checksum error, smartctl says everything is ok, dmesg is flagging some sectors, but hdparam says they are fine.
The next thing I want to try is running badblocks in non-destructive mode, but to do that I need to remove the zfs device first, and I'm unsure about how to do that correctly in this use-case.
I will say that all of the data I'm working with is backed up or recomputable.
Beta Was this translation helpful? Give feedback.
All reactions