Facing disk issues such as corruption or block defect is thankfully not happening frequently. However when it does, the diagnose may look sometimes like a real struggle. This is forgetting that disks integrate since years the S.M.A.R.T technology. Good point, it can easily be adressed in Linux command line through smartctl
What is S.M.A.R.T. exactly?
S.M.A.R.T., which stands for Self-Monitoring, Analysis and Reporting Technology (also called SMART), is a monitoring system for and integrated in hard disks allowing to detect and report various indicators of reliability.
SMART is the successor of various previous disk monitoring solution such as PFA (IBM, 1992) or IntelliSafe (Compaq, 1995). The join-work between Compaq, IBM, Seagate, Quantum, Conner and Western Digital resulted in the standard SMART.
SMART is part of AT Attachment (ATA) standard since 2004 and provides several information about disk such as:
- Status
- Manufacter information (Brand, serial number…)
- Inability to read or Write data
In addition SMART collects data through an offline monitoring and use thresholds in order to warn about potential disk failures. Finally it allows to run different disk integraty tests.
More detailed information about the SMART standard can be found here: http://en.wikipedia.org/wiki/S.M.A.R.T Getting information out of SMART under Linux is pretty easy, as the binary smartctl is available.
Basic operations
Like for any tool under linux, you can get some help on the usage of smartctl.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
srvoeltest1:~ # smartctl --help
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
</a>Usage: smartctl [options] device
============================================ SHOW INFORMATION OPTIONS =====
-h, --help, --usage
Display this help and exit
-V, --version, --copyright, --license
Print license, copyright, and version information and exit
-i, --info
Show identity information for device
-a, --all
Show all SMART information for device
================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====
-q TYPE, --quietmode=TYPE (ATA)
Set smartctl quiet mode to one of: errorsonly, silent, noserial
...
...
...
|
A good idea is also to take a look on the man page of smartctl: http://smartmontools.sourceforge.net/man/smartctl.8.html Then the first basic operation is to get the basic information about a disk. To do so run: smartctl -i /dev/<disk>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
srvoeltest1:~ # smartctl -i /dev/sda
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
=== START OF INFORMATION SECTION ===
Device Model: GB1000EAMYC
Serial Number : WMATV9306345
Firmware Version: HPG3
User Capacity: 1 , 000 , 204 , 886 , 016 bytes
Device is : Not in smartctl database [ for details use : -P showall]
ATA Version is : 7
ATA Standard is : ATA/ATAPI- 7 T13 1532D revision 4a
Local Time is : Wed Sep 5 13 : 24 : 31 2012 CEST
SMART support is : Available - device has SMART capability.
SMART support is : Enabled
|
We can see, using the -i option, information such as the model, S/N, capacity and more interresting we can check if SMART is available and activated for the disk.
If it is not activated yet, you can do it with command: smartctl -s on /dev/<disk>
Check disks health
Now that we have the base information about the disks we may want to check their state. So let’s check first if we simply still see them and can access then. This is done using the command: smartctl -H /dev/<disk>
1
2
3
4
5
6
7
8
9
10
11
|
srvoeltest1:~ # smartctl -H /dev/sda
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
srvoeltest1:~ # smartctl -H /dev/sdd
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
Smartctl open device: /dev/sdd failed: No such device
|
Remember that this check is a simple test of the disk availability. A PASSED result doesn’t mean that the disk is healthy. On the other hand any other result than PASSED let you know to that you should replace it.
As second step, you may want to check the disk capabilities using: smartctl -c /dev/<disk>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
srvoeltest1:~ # smartctl -c /dev/sda
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: ( 0x84 ) Offline data collection activity
was suspended by an interrupting command from host
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0 ) The previous self-test routine complete
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 18000 ) seconds.
Offline data collection
capabilities: ( 0x7b ) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: ( 0x0003 ) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: ( 0x01 ) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2 ) minutes.
Extended self-test routine
recommended polling time: ( 208 ) minutes.
Conveyance self-test routine
recommended polling time: ( 5 ) minutes.
SCT capabilities: ( 0x303f ) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
|
Capabilities let us know different information:
- Offline data collection status
- Last self-test status
- Type of test available and their duration
In my example above, for instance, a short self-test would take approximately 2 minutes.
Going deeper in scope of the disk health check, drives us then to check out the so called disk attributes. These are metrics collected for the disk with their current value, the vendor warning value and threshold. They include counters helping to detect/prevent potential failures like:
- Raw Read Error Rate
- Reallocated sector Ct
- Seek Error Rate
- Reallocated Event Count
The current state of the attributes can be get using the command: smartctl -A /dev/<disk>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
srvoeltest1:~ # smartctl -A /dev/sdb
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME VALUE WORST THRESH TYPE UPDATED RAW_VALUE
1 Raw_Read_Error_Rate 080 063 044 Pre-fail Always 115187504
3 Spin_Up_Time 093 093 000 Pre-fail Always 0
4 Start_Stop_Count 100 100 020 Old_age Always 26
5 Reallocated_Sector_Ct 100 100 036 Pre-fail Always 1
7 Seek_Error_Rate 087 060 030 Pre-fail Always 534389509
9 Power_On_Hours 087 087 000 Old_age Always 12084
10 Spin_Retry_Count 100 100 097 Pre-fail Always 0
12 Power_Cycle_Count 100 100 020 Old_age Always 26
180 Unknown_Attribute 100 100 000 Pre-fail Always 1535031962
184 Unknown_Attribute 100 100 003 Old_age Always 0
187 Reported_Uncorrect 100 100 000 Old_age Always 0
188 Unknown_Attribute 100 100 000 Old_age Always 0
189 High_Fly_Writes 100 100 000 Old_age Always 0
190 Airflow_Temp_Cel 073 065 045 Old_age Always 27
191 G-Sense_Error_Rate 100 100 000 Old_age Always 0
192 Power-Off_Retract_Ct 100 100 000 Old_age Always 22
193 Load_Cycle_Count 100 100 000 Old_age Always 26
194 Temperature_Celsius 027 040 000 Old_age Always 27
195 Hardware_ECC_Reco 034 025 000 Old_age Always 115187504
196 Reallocated_Event_Ct 100 100 036 Pre-fail Always 1
197 Curr_Pending_Sector 100 100 000 Old_age Always 0
198 Offline_Uncorrectable 100 100 000 Old_age Offline 0
199 UDMA_CRC_Error_Count 200 200 000 Old_age Always 0
|
According the attributes provided by SMART, this disk doesn’t look in good shape. So why not looking to its SMART error log? Let’s run smartctl -l error /dev/<disk>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
srvoeltest1:~ # smartctl -l error /dev/sdc
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
Error 14048 occurred at disk power-on lifetime: 5037 hours ( 209 days + 21 hours)
When the command that caused the error occurred, the device was active or
idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 95 58 37 e2 Error: UNC 8 sectors at LBA = 0x02375895 = 37181589
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 8e 58 37 e2 08 3d+ 10 : 30 : 59.131 READ DMA
c8 00 08 86 58 37 e2 08 3d+ 10 : 30 : 59.042 READ DMA
b0 d1 01 00 4f c2 00 08 3d+ 10 : 30 : 59.039 SMART READ ATTRIBUTE THRESHOLDS [OBS- 4 ]
c8 00 08 7e 58 37 e2 08 3d+ 10 : 30 : 58.713 READ DMA
ec 00 00 00 00 00 a0 08 3d+ 10 : 30 : 58.688 IDENTIFY DEVICE
Error 14046 occurred at disk power-on lifetime: 5037 hours ( 209 days + 21 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 7d 58 37 e2 Error: UNC 8 sectors at LBA = 0x0237587d = 37181565
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 76 58 37 e2 08 3d+ 10 : 30 : 54.293 READ DMA
ec 00 00 00 00 00 a0 08 3d+ 10 : 30 : 54.269 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 3d+ 10 : 30 : 54.262 SET FEATURES [Set transfer mode]
ec 00 00 00 00 00 a0 08 3d+ 10 : 30 : 54.253 IDENTIFY DEVICE
c8 00 08 76 58 37 e2 08 3d+ 10 : 30 : 52.335 READ DMA
|
Running selftest checks
If you have issues with a disk a good idea is then to run a selftest. Different tpyes of test are available with SMART
- short
Basic tests
- long
Extended SMART tests. Runs usually tens of minutes
- conveyance
Test dedicated to detection of damage during transport
- select
Selective self-test to test a range of disk Logical Block Addresses (LBA)
Run the test with the command: smartctl -t <type> /dev/<disk>
1
2
3
4
5
6
7
8
9
10
|
srvoeltest1:~ # smartctl -t short /dev/sdc
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode" .
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Wed Sep 5 11 : 44 : 26 2012
|
Once started the test will run in background. To check its results, use the command
smartctl -l selftest /dev/<disk>
1
2
3
4
5
6
7
8
|
srvoeltest1:~ # smartctl -l selftest /dev/sdc
smartctl 5.39 2008 - 10 - 24 22 : 33 [x86_64-suse-linux-gnu] (openSUSE RPM)
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90 % 6091 356245779
|
In the example above, we can see that the test failed after 10% on the Logical Block Address 356245779.
Last trick, the command to get the test output doesn’t refresh itself automatically. Therefore you may have to run it several time until the test finished. A easy workaround is to run it as following: watch smartctl -l selftest /dev/<disk>
This will provide you an every 2 seconds refresh of the output. I hope that this smartctl overview will help. |