Most home users don't build systems with RAID arrays and the systems that come pre-configured from vendors such as Dell and HP only ship with a single drive (unless multiple drives are configured into the system and purchased). Even if multiple drives are shipped, RAID is not typically configured by the factory before the PC ships. This means that when a home user's hard drive fails, everything from documents, games, music, and movies to family photos and other items that have a lot of sentimental value can be irreversibly lost. In some cases, the data can be recovered by a data recovery specialist, but the data recovered might be incomplete or corrupt.
The hard drive industry has had a number of years to work on this problem and has made great strides in both reducing the number of hard drive failures (and increasing the mean time between failures, MTBF) and working on predictive analysis that may indicate a drive is close to failing (note that the predictive analysis may not be valuable in the case of sudden catastrophic loss of the drive). One tool that is useful for predictive analysis is the Self Monitoring, Analysis, and Reporting Technology (SMART) functionality that is built into most modern hard drives. Note that Google performed a study and demonstrated that only subsets of the SMART attributes are useful for predictive analysis, where others are purely informational.
Each hard drive manufacturer (Seagate, Western Digital, Intel, Hitachi, Samsung, OCZ, etc...) defines their own metrics that they track and expose through SMART, but there are a number of standardized attributes. Vendors also have the flexibility to add specific logs that can be checked through SMART. If logs and metrics aren't enough, SMART also has the capability to run self-tests of the hard drive.
Viewing these attributes and logs requires a tool such as smartctl (part of smartmontools). I demonstrate smartmontools here because there are ports for a number of different platforms (Windows, Linux, UNIX, Mac OS, etc) and the source code is freely available to compile on new platforms. Note that most hardware RAID arrays do not expose the drives directly to the operating system, but most have vendor-supported tools for viewing SMART statistics on each of the attached drives.
Note that for this demonstration, I am using the Windows build of smartmontools version 5.42-1. If the command line arguments presented here don't work, see if you need slightly different parameters by running smartctl -h.
To start out, I opened a command prompt (cmd.exe) and navigated to the binary install path for the smartmontools (for me, C:\Program Files(x86)\smartmontools\bin). From here, I used the --scan option of smartctl to Identify which drives the operating system sees,
C:\Program Files (x86)\smartmontools\bin>smartctl --scan /dev/sda -d ata # /dev/sda, ATA device /dev/sdb -d ata # /dev/sdb, ATA device /dev/sdc -d ata # /dev/sdc, ATA device
In this case, my system has three SATA drives and all of them are visible to the operating system. Note that smartctl returns a more linux representation of the devices (using /dev/xxx instead of using the SCSI notation [port, bus, target, logical unit] that Windows uses internally for SATA drives). Note that this screenshot was taken from the msinfo32 utility,
We can select a disk and retrieve all of the SMART information from one of the drives,
C:\Program Files (x86)\smartmontools\bin>smartctl -a /dev/sda
The first section of the output is the information for the hard drive,
smartctl 5.42 2011-10-20 r3458 [i686-w64-mingw32-2008r2(64)-sp1] (sf-win32-5.42-1) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 Device Model: ST3500630AS Serial Number: 9QG3ZZZ9 Firmware Version: 3.AAK User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Dec 12 13:30:19 2011 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled
The second section identifies the capabilities with regard to SMART.
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 163) minutes.
The next section identifies vendor specific attributes/metrics. This is an important section for identifying predictive failures. There arre several values reported, the normalized value (VALUE column), the worst normalized value ever recorded while SMART has been enabled (WORST column), the threshold column, and finally the raw value (RAW_VALUE). The threshold column requires more explanation. The VALUE and WORST values are scaled between 0 and 255 and are typically reported in a way that less is worse. If the VALUE or WORST is below the threshold value, then this may be a sign that the disk needs to be replaced immediately, some attributes indicate that the disk is expected to fail within 24 hours. Some attributes, which are more informational (such as temperature) may not indicate an impending failure, but may indicate increased wear on the drive. In my case, I had some short term cooling problems with the PC and SMART reports a failure in the past.
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 067 006 Pre-fail Always - 83089801 3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 161 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 086 060 030 Pre-fail Always - 449282521 9 Power_On_Hours 0x0032 062 062 000 Old_age Always - 33680 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 163 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 064 044 045 Old_age Always In_the_past 36 (Min/Max 24/39) 194 Temperature_Celsius 0x0022 036 056 000 Old_age Always - 36 (0 21 0 0 0) 195 Hardware_ECC_Recovered 0x001a 062 056 000 Old_age Always - 55988229 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
The next pieces are the error log (also very important for determining impending failure) and self test log.
SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
From the data, this drive is not expected to immediately fail and all of the metrics indicate that the disk should continue to function. Tests can be performed on the drives by using smartctl -t <test_name> <drive>. See smartctl -h for more details.
If your drive is starting to fail or becomes unbootable, it may be necessary to rescue the files drom the failing hard drive.
See Also
How To Rescue Files From a Damaged System
Windows Crash Dump Analysis
Identifying Cooling Issues
Troubleshooting Memory Errors
Stress Testing a CPU To Detect Hardware Failure
Stress Testing a Video Card
try to use Victoria for Windows
ReplyDeleteIts like you read my mind! You appear to know a lot about this, like you wrote the book in it or something. I think that you can do with some pics to drive the message home a bit, but other than that, this is magnificent blog. An excellent read. I'll definitely be back.
ReplyDeletedata recovery irvine ca
I recently bought a new asus a53e-es92 notebook and sometimes I hear clicking sound inside.
ReplyDeleteCan it be failing?
Ian.
Thanks for your summary. Posting an example of the values of my failing drive, so your blog visitors get an idea how it would show. I had to time to offload the data. Check the Spin_Up_Time values and the remark on the right "failing_now"
ReplyDeleteSMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 5
2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
3 Spin_Up_Time 0x0023 007 007 025 Pre-fail Always FAILING_NOW 28395
4 Start_Stop_Count 0x0032 092 092 000 Old_age Always - 8487
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 10232
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 2
12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 3975
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 66
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 050 000 Old_age Always - 31 (Min/Max 12/51)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 261
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 2
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 8508