Skip to content

Commit d68c7b8

Browse files
nerda-codesbene2k1
authored andcommitted
docs(dedibox-hardware): add missing content (#4422)
1 parent 053743b commit d68c7b8

File tree

1 file changed

+261
-5
lines changed

1 file changed

+261
-5
lines changed

pages/dedibox-hardware/troubleshooting/diagnose-defective-disk.mdx

Lines changed: 261 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ dates:
1010
validation: 2025-02-06
1111
posted: 2021-11-02
1212
categories:
13-
- dedibox-servers
13+
- dedibox-hardware
1414
---
1515

1616
`Smartmontools` is a set of tools that controls and monitors a disk using the **SMART** standard (Self-Monitoring, Analysis, and Reporting Technology System).
@@ -56,7 +56,7 @@ On these servers, the physical disks are referred to as `sg*` devices.
5656
As the devices can be positioned a little further away, do not hesitate to test up to `sg5` if you do not have conclusive results.
5757
</Message>
5858

59-
### Dell PERC H310 controller
59+
### Dell PERC controller (H310, H700, H710, H730-P, LSI9361)
6060

6161
Two possibilities exist for this type of controller:
6262

@@ -83,7 +83,7 @@ The first one displays the status of the RAID volume, whilst the second one disp
8383
smartctl -s on -a -d megaraid,${i} ${DEVICE} -T permissive
8484
done
8585
```
86-
## How to check an HP multi-disk server
86+
## How to check an HP multi-disk server (P410, P420, P222)
8787

8888
1. Log into your server using SSH.
8989
2. Run the following command to display the status of the RAID:
@@ -121,7 +121,7 @@ The first one displays the status of the RAID volume, whilst the second one disp
121121

122122
### How to configure SMARTD
123123

124-
Below, you find an example of a single-disk server installed on a Debian-like machine.
124+
Below, you will find an example of a single-disk server installed on a Debian-like machine.
125125

126126
<Message type="note">
127127
The following commands are to be executed as `root` or via `sudo`.
@@ -193,4 +193,260 @@ Local Time is: Fri Oct 29 11:20:27 2010 CEST
193193

194194
<Message type="tip">
195195
For more information on Smartmontools, refer to the [official documentation](https://www.smartmontools.org/wiki/TocDoc).
196-
</Message>
196+
</Message>
197+
198+
<Tabs id="Smart data examples">
199+
<TabsTab label="HDD example">
200+
The example below shows SMART data for the HDD storage type:
201+
202+
```
203+
=== START OF INFORMATION SECTION ===
204+
Model Family: Seagate Constellation ES.3
205+
Device Model: ST1000NM0033-9ZM173
206+
Serial Number: Z1W2P3WL
207+
LU WWN Device Id: 5 000c50 0790721c5
208+
Add. Product Id: DELL(tm)
209+
Firmware Version: GA0A
210+
User Capacity: 1 000 204 886 016 bytes [1,00 TB]
211+
Sector Size: 512 bytes logical/physical
212+
Rotation Rate: 7200 rpm
213+
Form Factor: 3.5 inches
214+
Device is: In smartctl database [for details use: -P show]
215+
ATA Version is: ACS-2 (minor revision not indicated)
216+
SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
217+
Local Time is: Wed Jan 22 11:26:49 2025 CET
218+
SMART support is: Available - device has SMART capability.
219+
SMART support is: Enabled
220+
221+
=== START OF READ SMART DATA SECTION ===
222+
SMART overall-health self-assessment test result: PASSED
223+
224+
General SMART Values:
225+
Offline data collection status: (0x82) Offline data collection activity
226+
was completed without error.
227+
Auto Offline Data Collection: Enabled.
228+
Self-test execution status: ( 0) The previous self-test routine completed
229+
without error or no self-test has ever
230+
been run.
231+
Total time to complete Offline
232+
data collection: ( 90) seconds.
233+
Offline data collection
234+
capabilities: (0x7b) SMART execute Offline immediate.
235+
Auto Offline data collection on/off support.
236+
Suspend Offline collection upon new
237+
command.
238+
Offline surface scan supported.
239+
Self-test supported.
240+
Conveyance Self-test supported.
241+
Selective Self-test supported.
242+
SMART capabilities: (0x0003) Saves SMART data before entering
243+
power-saving mode.
244+
Supports SMART auto save timer.
245+
Error logging capability: (0x01) Error logging supported.
246+
General Purpose Logging supported.
247+
Short self-test routine
248+
recommended polling time: ( 2) minutes.
249+
Extended self-test routine
250+
recommended polling time: ( 115) minutes.
251+
Conveyance self-test routine
252+
recommended polling time: ( 3) minutes.
253+
SCT capabilities: (0x50bd) SCT Status supported.
254+
SCT Error Recovery Control supported.
255+
SCT Feature Control supported.
256+
SCT Data Table supported.
257+
258+
SMART Attributes Data Structure revision number: 10
259+
Vendor Specific SMART Attributes with Thresholds:
260+
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
261+
1 Raw_Read_Error_Rate 0x010f 079 063 044 Pre-fail Always - 90441339
262+
3 Spin_Up_Time 0x0103 096 095 000 Pre-fail Always - 0
263+
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 26
264+
5 Reallocated_Sector_Ct 0x0133 100 100 010 Pre-fail Always - 0
265+
7 Seek_Error_Rate 0x000f 093 060 030 Pre-fail Always - 2198492836
266+
9 Power_On_Hours 0x0032 094 011 000 Old_age Always - 5442
267+
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
268+
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 18
269+
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
270+
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
271+
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1
272+
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
273+
190 Airflow_Temperature_Cel 0x0022 071 061 045 Old_age Always - 29 (Min/Max 27/34)
274+
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
275+
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9
276+
193 Load_Cycle_Count 0x0032 094 094 000 Old_age Always - 12859
277+
194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 22 0 0 0)
278+
195 Hardware_ECC_Recovered 0x001a 046 015 000 Old_age Always - 90441339
279+
196 Reallocated_Event_Count 0x0032 000 000 000 Old_age Always - 65535
280+
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
281+
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
282+
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
283+
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 62209 (42 197 0)
284+
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 75618145300
285+
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 528734761477
286+
287+
SMART Error Log Version: 1
288+
No Errors Logged
289+
```
290+
If `total_uncorrected_errors` or `errors_corrected_by_rereads_rewrites` is > 0, the disk is out of order.
291+
</TabsTab>
292+
<TabsTab label="SSD example">
293+
The example below shows SMART data for the SSD storage type:
294+
295+
```
296+
=== START OF INFORMATION SECTION ===
297+
Model Family: Crucial/Micron MX1/2/300, M5/600, 1100 Client SSDs
298+
Device Model: Micron_1100_MTFDDAK512TBN
299+
Serial Number: 1709160C2354
300+
LU WWN Device Id: 5 00a075 1160c2354
301+
Firmware Version: M0MU031
302+
User Capacity: 512 110 190 592 bytes [512 GB]
303+
Sector Size: 512 bytes logical/physical
304+
Rotation Rate: Solid State Device
305+
Form Factor: 2.5 inches
306+
Device is: In smartctl database [for details use: -P show]
307+
ATA Version is: ACS-3 T13/2161-D revision 5
308+
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
309+
Local Time is: Wed Jan 22 11:24:34 2025 CET
310+
SMART support is: Available - device has SMART capability.
311+
SMART support is: Enabled
312+
313+
=== START OF READ SMART DATA SECTION ===
314+
SMART overall-health self-assessment test result: PASSED
315+
316+
General SMART Values:
317+
Offline data collection status: (0x03) Offline data collection activity
318+
is in progress.
319+
Auto Offline Data Collection: Disabled.
320+
Self-test execution status: ( 0) The previous self-test routine completed
321+
without error or no self-test has ever
322+
been run.
323+
Total time to complete Offline
324+
data collection: ( 913) seconds.
325+
Offline data collection
326+
capabilities: (0x7b) SMART execute Offline immediate.
327+
Auto Offline data collection on/off support.
328+
Suspend Offline collection upon new
329+
command.
330+
Offline surface scan supported.
331+
Self-test supported.
332+
Conveyance Self-test supported.
333+
Selective Self-test supported.
334+
SMART capabilities: (0x0003) Saves SMART data before entering
335+
power-saving mode.
336+
Supports SMART auto save timer.
337+
Error logging capability: (0x01) Error logging supported.
338+
General Purpose Logging supported.
339+
Short self-test routine
340+
recommended polling time: ( 2) minutes.
341+
Extended self-test routine
342+
recommended polling time: ( 7) minutes.
343+
Conveyance self-test routine
344+
recommended polling time: ( 3) minutes.
345+
SCT capabilities: (0x0035) SCT Status supported.
346+
SCT Feature Control supported.
347+
SCT Data Table supported.
348+
349+
SMART Attributes Data Structure revision number: 16
350+
Vendor Specific SMART Attributes with Thresholds:
351+
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
352+
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 11
353+
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 10
354+
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 63309
355+
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 12
356+
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 1
357+
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
358+
173 Ave_Block-Erase_Count 0x0032 060 060 000 Old_age Always - 610
359+
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 6
360+
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
361+
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
362+
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
363+
194 Temperature_Celsius 0x0022 068 047 000 Old_age Always - 32 (Min/Max 24/53)
364+
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 10
365+
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
366+
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
367+
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
368+
202 Percent_Lifetime_Used 0x0030 060 060 001 Old_age Offline - 40
369+
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 1
370+
246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 72065906327
371+
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 2254963742
372+
248 Bckgnd_Program_Page_Cnt 0x0032 100 100 000 Old_age Always - 15919135484
373+
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 2459
374+
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 44
375+
376+
SMART Error Log Version: 1
377+
No Errors Logged
378+
```
379+
If the `RAW_VALUE` column for `Reallocated_Sector_Ct` or ` Runtime_Bad_Block` or `Current_Pending_Sector` is > 5, the disk can already be considered as unhealthy. If it is > 20, the disk is out of order.
380+
</TabsTab>
381+
<TabsTab label="NVMe example">
382+
The example below shows SMART data for the NVMe storage type:
383+
384+
```
385+
=== START OF INFORMATION SECTION ===
386+
Model Number: SKHynix_HFS512GEJ9X164N
387+
Serial Number: 4YC8N008713108B48
388+
Firmware Version: 51770C30
389+
PCI Vendor/Subsystem ID: 0x1c5c
390+
IEEE OUI Identifier: 0xace42e
391+
Controller ID: 1
392+
NVMe Version: 1.4
393+
Number of Namespaces: 1
394+
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
395+
Namespace 1 Formatted LBA Size: 512
396+
Namespace 1 IEEE EUI-64: ace42e 003abd04e2
397+
Local Time is: Wed Jan 22 11:21:05 2025 CET
398+
Firmware Updates (0x16): 3 Slots, no Reset required
399+
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
400+
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
401+
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
402+
Maximum Data Transfer Size: 64 Pages
403+
Warning Comp. Temp. Threshold: 86 Celsius
404+
Critical Comp. Temp. Threshold: 87 Celsius
405+
406+
Supported Power States
407+
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
408+
0 + 4.5000W - - 0 0 0 0 100 100
409+
1 + 3.0000W - - 1 1 1 1 200 200
410+
2 + 0.6000W - - 2 2 2 2 400 400
411+
3 - 0.0150W - - 3 3 3 3 2000 2000
412+
4 - 0.0030W - - 4 4 4 4 5000 10000
413+
414+
Supported LBA Sizes (NSID 0x1)
415+
Id Fmt Data Metadt Rel_Perf
416+
0 + 512 0 0
417+
418+
=== START OF SMART DATA SECTION ===
419+
SMART overall-health self-assessment test result: PASSED
420+
421+
SMART/Health Information (NVMe Log 0x02)
422+
Critical Warning: 0x00
423+
Temperature: 42 Celsius
424+
Available Spare: 100%
425+
Available Spare Threshold: 10%
426+
Percentage Used: 1%
427+
Data Units Read: 5,718,407 [2.92 TB]
428+
Data Units Written: 9,717,865 [4.97 TB]
429+
Host Read Commands: 43,061,485
430+
Host Write Commands: 142,156,172
431+
Controller Busy Time: 5,906
432+
Power Cycles: 1,315
433+
Power On Hours: 2,261
434+
Unsafe Shutdowns: 56
435+
Media and Data Integrity Errors: 0
436+
Error Information Log Entries: 0
437+
Warning Comp. Temperature Time: 0
438+
Critical Comp. Temperature Time: 0
439+
Temperature Sensor 1: 44 Celsius
440+
Temperature Sensor 2: 42 Celsius
441+
442+
Error Information (NVMe Log 0x01, 16 of 256 entries)
443+
No Errors Logged
444+
445+
Read Self-test Log failed: Invalid Field in Command (0x002)
446+
```
447+
</TabsTab>
448+
</Tabs>
449+
450+
<Message type="note">
451+
If you encounter **Health status: Failed** or **Failing Now**, the disk is considered out of order. Make sure that you have backups, then open a [support ticket](/account/how-to/open-a-support-ticket/) and ask for the disk to be replaced, indicating the serial number with the result of the `smartctl` command.
452+
</Message>

0 commit comments

Comments
 (0)