Linux: Disks diagnostic using smartctl

Facing disk issues such as corruption or block defect is thankfully not happening frequently. However when it does, the diagnose may look sometimes like a real struggle. This is forgetting that disks integrate since years the S.M.A.R.T technology. Good point, it can easily be adressed in Linux command line through smartctl

 

What is S.M.A.R.T. exactly?

S.M.A.R.T., which stands for Self-Monitoring, Analysis and Reporting Technology (also called SMART), is a monitoring system for and integrated in hard disks allowing to detect and report various indicators of reliability.

SMART is the successor of various previous disk monitoring solution such as PFA (IBM, 1992) or IntelliSafe (Compaq, 1995). The join-work between Compaq, IBM, Seagate, Quantum, Conner and Western Digital resulted in the standard SMART.

SMART is part of AT Attachment (ATA) standard since 2004 and provides several information about disk such as:

  • Status
  • Manufacter information (Brand, serial number…)
  • Inability to read or Write data

In addition SMART collects data through an offline monitoring and use thresholds in order to warn about potential disk failures. Finally it allows to run different disk integraty tests.

More detailed information about the SMART standard can be found here: http://en.wikipedia.org/wiki/S.M.A.R.T
Getting information out of SMART under Linux is pretty easy, as the binary smartctl is available.

Basic operations

Like for any tool under linux, you can get some help on the usage of smartctl.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
srvoeltest1:~ # smartctl --help
smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net
 
</a>Usage: smartctl [options] device
  
 ============================================ SHOW INFORMATION OPTIONS =====
  
 -h, --help, --usage
 Display this help and exit
  
 -V, --version, --copyright, --license
 Print license, copyright, and version information and exit
  
 -i, --info
 Show identity information for device
  
 -a, --all
 Show all SMART information for device
  
 ================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====
  
 -q TYPE, --quietmode=TYPE (ATA)
 Set smartctl quiet mode to one of: errorsonly, silent, noserial
...
...
...

A good idea is also to take a look on the man page of smartctl: http://smartmontools.sourceforge.net/man/smartctl.8.html
Then the first basic operation is to get the basic information about a disk. To do so run: smartctl -i /dev/<disk>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
srvoeltest1:~ # smartctl -i /dev/sda
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
 Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
  
 === START OF INFORMATION SECTION ===
 Device Model: GB1000EAMYC
 Serial Number: WMATV9306345
 Firmware Version: HPG3
 User Capacity: 1,000,204,886,016 bytes
 Device is: Not in smartctl database [for details use: -P showall]
 ATA Version is: 7
 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
 Local Time is: Wed Sep 5 13:24:31 2012 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

We can see, using the -i option, information such as the model, S/N, capacity and more interresting we can check if SMART is available and activated for the disk.

If it is not activated yet, you can do it with command: smartctl -s on /dev/<disk>

Check disks health

Now that we have the base information about the disks we may want to check their state. So let’s check first if we simply still see them and can access then. This is done using the command: smartctl -H /dev/<disk>

1
2
3
4
5
6
7
8
9
10
11
srvoeltest1:~ # smartctl -H /dev/sda
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
 Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
  
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
  
 srvoeltest1:~ # smartctl -H /dev/sdd
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
 Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
Smartctl open device: /dev/sdd failed: No such device

Remember that this check is a simple test of the disk availability. A PASSED result doesn’t mean that the disk is healthy. On the other hand any other result than PASSED let you know to that you should replace it.

As second step, you may want to check the disk capabilities using: smartctl -c /dev/<disk>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
srvoeltest1:~ # smartctl -c /dev/sda
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
 Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
  
 === START OF READ SMART DATA SECTION ===
 General SMART Values:
 Offline data collection status: (0x84) Offline data collection activity
 was suspended by an interrupting command from host
 Auto Offline Data Collection: Enabled.
 Self-test execution status: ( 0) The previous self-test routine complete
 without error or no self-test has ever
 been run.
 Total time to complete Offline
 data collection: (18000) seconds.
 Offline data collection
 capabilities: (0x7b) SMART execute Offline immediate.
 Auto Offline data collection on/off support.
 Suspend Offline collection upon new
 command.
Offline surface scan supported.
 Self-test supported.
 Conveyance Self-test supported.
 Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
 Supports SMART auto save timer.
 Error logging capability: (0x01) Error logging supported.
 General Purpose Logging supported.
 Short self-test routine
 recommended polling time: ( 2) minutes.
 Extended self-test routine
 recommended polling time: ( 208) minutes.
 Conveyance self-test routine
 recommended polling time: ( 5) minutes.
 SCT capabilities: (0x303f) SCT Status supported.
 SCT Feature Control supported.
 SCT Data Table supported.

Capabilities let us know different information:

  • Offline data collection status
  • Last self-test status
  • Type of test available and their duration
    In my example above, for instance, a short self-test would take approximately 2 minutes.

Going deeper in scope of the disk health check, drives us then to check out the so called disk attributes. These are metrics collected for the disk with their current value, the vendor warning value and threshold. They include counters helping to detect/prevent potential failures like:

  • Raw Read Error Rate
  • Reallocated sector Ct
  • Seek Error Rate
  • Reallocated Event Count

The current state of the attributes can be get using the command: smartctl -A /dev/<disk>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
srvoeltest1:~ # smartctl -A /dev/sdb
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
 Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
  
 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 10
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME VALUE WORST THRESH TYPE UPDATED RAW_VALUE
 1 Raw_Read_Error_Rate 080 063 044 Pre-fail Always 115187504
 3 Spin_Up_Time 093 093 000 Pre-fail Always 0
4 Start_Stop_Count 100 100 020 Old_age Always 26
 5 Reallocated_Sector_Ct 100 100 036 Pre-fail Always 1
 7 Seek_Error_Rate 087 060 030 Pre-fail Always 534389509
 9 Power_On_Hours 087 087 000 Old_age Always 12084
 10 Spin_Retry_Count 100 100 097 Pre-fail Always 0
 12 Power_Cycle_Count 100 100 020 Old_age Always 26
 180 Unknown_Attribute 100 100 000 Pre-fail Always 1535031962
 184 Unknown_Attribute 100 100 003 Old_age Always 0
 187 Reported_Uncorrect 100 100 000 Old_age Always 0
 188 Unknown_Attribute 100 100 000 Old_age Always 0
 189 High_Fly_Writes 100 100 000 Old_age Always 0
 190 Airflow_Temp_Cel 073 065 045 Old_age Always 27
 191 G-Sense_Error_Rate 100 100 000 Old_age Always 0
 192 Power-Off_Retract_Ct 100 100 000 Old_age Always 22
 193 Load_Cycle_Count 100 100 000 Old_age Always 26
 194 Temperature_Celsius 027 040 000 Old_age Always 27
 195 Hardware_ECC_Reco 034 025 000 Old_age Always 115187504
 196 Reallocated_Event_Ct 100 100 036 Pre-fail Always 1
 197 Curr_Pending_Sector 100 100 000 Old_age Always 0
 198 Offline_Uncorrectable 100 100 000 Old_age Offline 0
 199 UDMA_CRC_Error_Count 200 200 000 Old_age Always 0

According the attributes provided by SMART, this disk doesn’t look in good shape. So why not looking to its SMART error log? Let’s run smartctl -l error /dev/<disk>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
srvoeltest1:~ # smartctl -l error /dev/sdc
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
 Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
  
 === START OF READ SMART DATA SECTION ===
 SMART Error Log Version: 1
 Error 14048 occurred at disk power-on lifetime: 5037 hours (209 days + 21 hours)
 When the command that caused the error occurred, the device was active or
 idle.
 
 After command completion occurred, registers were:
ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 08 95 58 37 e2 Error: UNC 8 sectors at LBA = 0x02375895 = 37181589
  
 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 c8 00 08 8e 58 37 e2 08 3d+10:30:59.131 READ DMA
 c8 00 08 86 58 37 e2 08 3d+10:30:59.042 READ DMA
 b0 d1 01 00 4f c2 00 08 3d+10:30:59.039 SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
 c8 00 08 7e 58 37 e2 08 3d+10:30:58.713 READ DMA
 ec 00 00 00 00 00 a0 08 3d+10:30:58.688 IDENTIFY DEVICE
  
 Error 14046 occurred at disk power-on lifetime: 5037 hours (209 days + 21 hours)
 When the command that caused the error occurred, the device was active or idle.
  
 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 08 7d 58 37 e2 Error: UNC 8 sectors at LBA = 0x0237587d = 37181565
 
 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
 -- -- -- -- -- -- -- -- ---------------- --------------------
 c8 00 08 76 58 37 e2 08 3d+10:30:54.293 READ DMA
 ec 00 00 00 00 00 a0 08 3d+10:30:54.269 IDENTIFY DEVICE
 ef 03 46 00 00 00 a0 08 3d+10:30:54.262 SET FEATURES [Set transfer mode]
 ec 00 00 00 00 00 a0 08 3d+10:30:54.253 IDENTIFY DEVICE
 c8 00 08 76 58 37 e2 08 3d+10:30:52.335 READ DMA

Running selftest checks

If you have issues with a disk a good idea is then to run a selftest. Different tpyes of test are available with SMART

  • short
    Basic tests
  • long
    Extended SMART tests. Runs usually tens of minutes
  • conveyance
    Test dedicated to detection of damage during transport
  • select
    Selective self-test to test a range of disk Logical Block Addresses (LBA)

Run the test with the command: smartctl -t <type> /dev/<disk>

1
2
3
4
5
6
7
8
9
10
srvoeltest1:~ # smartctl -t short /dev/sdc
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
 Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
  
 === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
 Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
 Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
 Please wait 2 minutes for test to complete.
 Test will complete after Wed Sep 5 11:44:26 2012

Once started the test will run in background. To check its results, use the command

smartctl -l selftest /dev/<disk>

1
2
3
4
5
6
7
8
srvoeltest1:~ # smartctl -l selftest /dev/sdc
 smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-8 by Bruce Allen, <a href="http://smartmontools.sourceforge.net">http://smartmontools.sourceforge.net</a>
  
 === START OF READ SMART DATA SECTION ===
 SMART Self-test log structure revision number 1
 Num Test_Description   Status                   Remaining LifeTime(hours) LBA_of_first_error
 # 1 Short offline      Completed: read failure  90%       6091            356245779

In the example above, we can see that the test failed after 10% on the Logical Block Address 356245779.

Last trick, the command to get the test output doesn’t refresh itself automatically. Therefore you may have to run it several time until the test finished. A easy workaround is to run it as following: watch smartctl -l selftest /dev/<disk>

This will provide you an every 2 seconds refresh of the output.
I hope that this smartctl overview will help.

Posted - Wed, Jul 25, 2018 2:07 PM. This article has been viewed 14215 times.
Online URL: http://kb.ictbanking.net/article.php?id=333

Powered by PHPKB (Knowledge Base Software)