diff --git a/pages/dedibox-hardware/troubleshooting/diagnose-defective-disk.mdx b/pages/dedibox-hardware/troubleshooting/diagnose-defective-disk.mdx index ac6ee32cdc..0ed5245f6d 100644 --- a/pages/dedibox-hardware/troubleshooting/diagnose-defective-disk.mdx +++ b/pages/dedibox-hardware/troubleshooting/diagnose-defective-disk.mdx @@ -10,7 +10,7 @@ dates: validation: 2025-02-06 posted: 2021-11-02 categories: - - dedibox-servers + - dedibox-hardware --- `Smartmontools` is a set of tools that controls and monitors a disk using the **SMART** standard (Self-Monitoring, Analysis, and Reporting Technology System). @@ -56,7 +56,7 @@ On these servers, the physical disks are referred to as `sg*` devices. As the devices can be positioned a little further away, do not hesitate to test up to `sg5` if you do not have conclusive results. -### Dell PERC H310 controller +### Dell PERC controller (H310, H700, H710, H730-P, LSI9361) Two possibilities exist for this type of controller: @@ -83,7 +83,7 @@ The first one displays the status of the RAID volume, whilst the second one disp smartctl -s on -a -d megaraid,${i} ${DEVICE} -T permissive done ``` -## How to check an HP multi-disk server +## How to check an HP multi-disk server (P410, P420, P222) 1. Log into your server using SSH. 2. Run the following command to display the status of the RAID: @@ -121,7 +121,7 @@ The first one displays the status of the RAID volume, whilst the second one disp ### How to configure SMARTD -Below, you find an example of a single-disk server installed on a Debian-like machine. +Below, you will find an example of a single-disk server installed on a Debian-like machine. The following commands are to be executed as `root` or via `sudo`. @@ -193,4 +193,260 @@ Local Time is: Fri Oct 29 11:20:27 2010 CEST For more information on Smartmontools, refer to the [official documentation](https://www.smartmontools.org/wiki/TocDoc). - \ No newline at end of file + + + + + The example below shows SMART data for the HDD storage type: + + ``` + === START OF INFORMATION SECTION === + Model Family: Seagate Constellation ES.3 + Device Model: ST1000NM0033-9ZM173 + Serial Number: Z1W2P3WL + LU WWN Device Id: 5 000c50 0790721c5 + Add. Product Id: DELL(tm) + Firmware Version: GA0A + User Capacity: 1 000 204 886 016 bytes [1,00 TB] + Sector Size: 512 bytes logical/physical + Rotation Rate: 7200 rpm + Form Factor: 3.5 inches + Device is: In smartctl database [for details use: -P show] + ATA Version is: ACS-2 (minor revision not indicated) + SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s) + Local Time is: Wed Jan 22 11:26:49 2025 CET + SMART support is: Available - device has SMART capability. + SMART support is: Enabled + + === START OF READ SMART DATA SECTION === + SMART overall-health self-assessment test result: PASSED + + General SMART Values: + Offline data collection status: (0x82) Offline data collection activity + was completed without error. + Auto Offline Data Collection: Enabled. + Self-test execution status: ( 0) The previous self-test routine completed + without error or no self-test has ever + been run. + Total time to complete Offline + data collection: ( 90) seconds. + Offline data collection + capabilities: (0x7b) SMART execute Offline immediate. + Auto Offline data collection on/off support. + Suspend Offline collection upon new + command. + Offline surface scan supported. + Self-test supported. + Conveyance Self-test supported. + Selective Self-test supported. + SMART capabilities: (0x0003) Saves SMART data before entering + power-saving mode. + Supports SMART auto save timer. + Error logging capability: (0x01) Error logging supported. + General Purpose Logging supported. + Short self-test routine + recommended polling time: ( 2) minutes. + Extended self-test routine + recommended polling time: ( 115) minutes. + Conveyance self-test routine + recommended polling time: ( 3) minutes. + SCT capabilities: (0x50bd) SCT Status supported. + SCT Error Recovery Control supported. + SCT Feature Control supported. + SCT Data Table supported. + + SMART Attributes Data Structure revision number: 10 + Vendor Specific SMART Attributes with Thresholds: + ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE + 1 Raw_Read_Error_Rate 0x010f 079 063 044 Pre-fail Always - 90441339 + 3 Spin_Up_Time 0x0103 096 095 000 Pre-fail Always - 0 + 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 26 + 5 Reallocated_Sector_Ct 0x0133 100 100 010 Pre-fail Always - 0 + 7 Seek_Error_Rate 0x000f 093 060 030 Pre-fail Always - 2198492836 + 9 Power_On_Hours 0x0032 094 011 000 Old_age Always - 5442 + 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 + 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 18 + 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 + 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 + 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1 + 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 + 190 Airflow_Temperature_Cel 0x0022 071 061 045 Old_age Always - 29 (Min/Max 27/34) + 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 + 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9 + 193 Load_Cycle_Count 0x0032 094 094 000 Old_age Always - 12859 + 194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 22 0 0 0) + 195 Hardware_ECC_Recovered 0x001a 046 015 000 Old_age Always - 90441339 + 196 Reallocated_Event_Count 0x0032 000 000 000 Old_age Always - 65535 + 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 + 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 + 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 + 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 62209 (42 197 0) + 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 75618145300 + 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 528734761477 + + SMART Error Log Version: 1 + No Errors Logged + ``` + If `total_uncorrected_errors` or `errors_corrected_by_rereads_rewrites` is > 0, the disk is out of order. + + + The example below shows SMART data for the SSD storage type: + + ``` + === START OF INFORMATION SECTION === + Model Family: Crucial/Micron MX1/2/300, M5/600, 1100 Client SSDs + Device Model: Micron_1100_MTFDDAK512TBN + Serial Number: 1709160C2354 + LU WWN Device Id: 5 00a075 1160c2354 + Firmware Version: M0MU031 + User Capacity: 512 110 190 592 bytes [512 GB] + Sector Size: 512 bytes logical/physical + Rotation Rate: Solid State Device + Form Factor: 2.5 inches + Device is: In smartctl database [for details use: -P show] + ATA Version is: ACS-3 T13/2161-D revision 5 + SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) + Local Time is: Wed Jan 22 11:24:34 2025 CET + SMART support is: Available - device has SMART capability. + SMART support is: Enabled + + === START OF READ SMART DATA SECTION === + SMART overall-health self-assessment test result: PASSED + + General SMART Values: + Offline data collection status: (0x03) Offline data collection activity + is in progress. + Auto Offline Data Collection: Disabled. + Self-test execution status: ( 0) The previous self-test routine completed + without error or no self-test has ever + been run. + Total time to complete Offline + data collection: ( 913) seconds. + Offline data collection + capabilities: (0x7b) SMART execute Offline immediate. + Auto Offline data collection on/off support. + Suspend Offline collection upon new + command. + Offline surface scan supported. + Self-test supported. + Conveyance Self-test supported. + Selective Self-test supported. + SMART capabilities: (0x0003) Saves SMART data before entering + power-saving mode. + Supports SMART auto save timer. + Error logging capability: (0x01) Error logging supported. + General Purpose Logging supported. + Short self-test routine + recommended polling time: ( 2) minutes. + Extended self-test routine + recommended polling time: ( 7) minutes. + Conveyance self-test routine + recommended polling time: ( 3) minutes. + SCT capabilities: (0x0035) SCT Status supported. + SCT Feature Control supported. + SCT Data Table supported. + + SMART Attributes Data Structure revision number: 16 + Vendor Specific SMART Attributes with Thresholds: + ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE + 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 11 + 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 10 + 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 63309 + 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 12 + 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 1 + 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 + 173 Ave_Block-Erase_Count 0x0032 060 060 000 Old_age Always - 610 + 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 6 + 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 + 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 + 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 + 194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 24/53) + 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 10 + 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 + 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 + 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 + 202 Percent_Lifetime_Used 0x0030 060 060 001 Old_age Offline - 40 + 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 1 + 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 72065906327 + 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 2254963742 + 248 Bckgnd_Program_Page_Cnt 0x0032 100 100 000 Old_age Always - 15919135484 + 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2459 + 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 44 + + SMART Error Log Version: 1 + No Errors Logged + ``` + If the `RAW_VALUE` column for `Reallocated_Sector_Ct` or ` Runtime_Bad_Block` or `Current_Pending_Sector` is > 5, the disk can already be considered as unhealthy. If it is > 20, the disk is out of order. + + + The example below shows SMART data for the NVMe storage type: + + ``` + === START OF INFORMATION SECTION === + Model Number: SKHynix_HFS512GEJ9X164N + Serial Number: 4YC8N008713108B48 + Firmware Version: 51770C30 + PCI Vendor/Subsystem ID: 0x1c5c + IEEE OUI Identifier: 0xace42e + Controller ID: 1 + NVMe Version: 1.4 + Number of Namespaces: 1 + Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] + Namespace 1 Formatted LBA Size: 512 + Namespace 1 IEEE EUI-64: ace42e 003abd04e2 + Local Time is: Wed Jan 22 11:21:05 2025 CET + Firmware Updates (0x16): 3 Slots, no Reset required + Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test + Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp + Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg + Maximum Data Transfer Size: 64 Pages + Warning Comp. Temp. Threshold: 86 Celsius + Critical Comp. Temp. Threshold: 87 Celsius + + Supported Power States + St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat + 0 + 4.5000W - - 0 0 0 0 100 100 + 1 + 3.0000W - - 1 1 1 1 200 200 + 2 + 0.6000W - - 2 2 2 2 400 400 + 3 - 0.0150W - - 3 3 3 3 2000 2000 + 4 - 0.0030W - - 4 4 4 4 5000 10000 + + Supported LBA Sizes (NSID 0x1) + Id Fmt Data Metadt Rel_Perf + 0 + 512 0 0 + + === START OF SMART DATA SECTION === + SMART overall-health self-assessment test result: PASSED + + SMART/Health Information (NVMe Log 0x02) + Critical Warning: 0x00 + Temperature: 42 Celsius + Available Spare: 100% + Available Spare Threshold: 10% + Percentage Used: 1% + Data Units Read: 5,718,407 [2.92 TB] + Data Units Written: 9,717,865 [4.97 TB] + Host Read Commands: 43,061,485 + Host Write Commands: 142,156,172 + Controller Busy Time: 5,906 + Power Cycles: 1,315 + Power On Hours: 2,261 + Unsafe Shutdowns: 56 + Media and Data Integrity Errors: 0 + Error Information Log Entries: 0 + Warning Comp. Temperature Time: 0 + Critical Comp. Temperature Time: 0 + Temperature Sensor 1: 44 Celsius + Temperature Sensor 2: 42 Celsius + + Error Information (NVMe Log 0x01, 16 of 256 entries) + No Errors Logged + + Read Self-test Log failed: Invalid Field in Command (0x002) + ``` + + + + +If you encounter **Health status: Failed** or **Failing Now**, the disk is considered out of order. Make sure that you have backups, then open a [support ticket](/account/how-to/open-a-support-ticket/) and ask for the disk to be replaced, indicating the serial number with the result of the `smartctl` command. +