Handling Disk Failures
The best way to deal with disk failures it to assume they will happen and be prepared. See Reliable data storage for best practises.
The cause of action is to:
- Run a SMART test (if that hasn't been done) to find which sector on the disk fails.
- Write directly to the sector with dd. This forces the drive to relocate the sector to one of your extra sectors.
- Scrub the file system on the disk. If you have a mirrored copy, everything should be restored to normal.
- If you don't have a mirror, use your backup disk to restore the data.
- Decide if you want to replace the disk to prevent future failures.
A bad block may be reported as follows:
Device: /dev/ada0, 1 Currently unreadable (pending) sectors Device: /dev/ada0, Self-Test Log error count increased from 0 to 1
Bad blocks are best detected by running a SMART self-test, assuming that the disk has S.M.A.R.T. support.
To examine the status (including indication if a test is still running):
# smartctl -c /dev/ada0
which will include one of these results:
Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run.
Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining.
To get the log of the last completed test, run:
# smartctl -l selftest /dev/ada0
smartctl -a /dev/ada0 to get all information about the disk, including the last test results).
To start a test:
# smartctl -t long /dev/ada0
# smartctl -t short /dev/ada0
To abort a test:
# smartctl -X /dev/ada0
Keep an eye on the following parameters:
|ID||Attribute mnemonic||Attribute name||Desired Value||Meaning|
|1||Raw_Read_Error_Rate||Read Error Rate||Low (<20)||Usually fixed by error correction, so not very worrysome. For seagate, only look at the first 32 bits of this 64 bit number.|
|2||Throughput_Performance||Throughput Performance||High (>100)||Performance criterium only.|
|3||Spin_Up_Time||Spin-Up Time||Low (<1000)||Value in ms. Performance criterium only.|
|5||Reallocated_Sector_Ct||Reallocated Sectors Count||Low (<5)||Error that was succesfully fixed. If it starts increasing over time, consider replacing the disk.|
|7||Seek_Error_Rate||Seek Error Rate||Low (<10)||Not immediately an issue, but could be a sign of starting mechanical failure. For seagate, only look at the most significant 32 bits of this 64 bit number. The least significant 32 bit indicate the number of seeks.|
|10||Seek_Error_Rate||Spin Retry Count||Low (<5)||Not immediately an issue, but could be a sign of starting failure of the motor.|
|22||N/A||Current Helium Level||100 (>95)||Not an immediate issue. Specific to He8 drives from HGST. This value measures the helium inside of the drive.|
|181||Program_Fail_Count||Non-4K Aligned Access Count||0||Not a disk issue, but if >0, indication that your file system partitions are not aligned with disk blocks. This is a performance issue, that can be fixed by repartitioning your disk.|
|182||Erase_Fail_Count||Erase Fail Count||Low (<10)||Not an immediate issue, but could be a sign of wear of a SSD.|
|184||N/A||End-to-End error / IOEDC||Low (<10)||IO Error Detected by flipped parity bit. SSD.|
|187||Reported_Uncorrect||Reported Uncorrectable Errors||Zero (0)||If non-zero, replace your disk. It has failed.|
|196||Reallocated_Event_Count||Reallocation Event Count||Low (<10)||A block was reallocated. Either successful or unsuccessful. Should remain low. If it increases over time, replace your disk.|
|197||Current_Pending_Sector||Current Pending Sector Count||Zero (<2)||A bad block was detected, but not yet relocated. For example, because it could not be read, and the disk does not know what to write at the new location. This may still be fixed, in which case this value should decrease to 0. Do another file system test, followed by a smart check. Perhaps a reboot. If it remains high after that, or if the uncorrectable sector count increases, replace the disk.|
|198||Offline_Uncorrectable||Uncorrectable Sector Count||Zero (0)||If non-zero, replace your disk. It has failed.|
|199||UDMA_CRC_Error_Count||UltraDMA CRC Error Count||Zero (0)||Errors in data transfer via the interface cable. If non-zero, replace your disk. It has failed.|
|200||Multi-Zone Error Rate||Uncorrectable Sector Count||Zero (0)||Errors found when writing a sector. If non-zero, replace your disk. It has failed.|
|201||Unc_Soft_Read_Err_Rate||Soft Read Error Rate||Zero (0)||Uncorrectable software read errors. If non-zero, replace your disk. It has failed.|
|204||Soft_ECC_Correct_Rate||Soft ECC Correction||Low (<20)||Errors corrected by the internal error correction software.|
Hard Disk (spinning platters)
# smartctl -a /dev/ada0 === START OF INFORMATION SECTION === ... User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical ... SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 155 150 021 Pre-fail Always - 9225 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 91 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22013 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 89 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 74 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 914156 194 Temperature_Celsius 0x0022 108 107 000 Old_age Always - 44 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 <---- this ought to be 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 5 <---- this is worrying. SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 22010 - # 2 Extended offline Completed without error 00% 21999 - # 3 Short offline Completed without error 00% 21966 - # 4 Short offline Completed without error 00% 21965 - # 5 Short offline Completed without error 00% 21964 - # 6 Short offline Completed without error 00% 21963 - # 7 Short offline Completed without error 00% 21962 - # 8 Short offline Completed without error 00% 21961 - # 9 Short offline Completed without error 00% 21960 - #10 Short offline Completed without error 00% 21959 - #11 Short offline Completed without error 00% 21958 - #12 Short offline Completed: read failure 90% 21957 310949139 #13 Short offline Completed without error 00% 21956 - #14 Short offline Completed without error 00% 21955 - #15 Short offline Completed without error 00% 21954 - #16 Short offline Completed without error 00% 21953 - #17 Short offline Completed without error 00% 21952 - #18 Short offline Completed without error 00% 21951 - #19 Short offline Completed without error 00% 21950 - #20 Short offline Completed without error 00% 21949 - #21 Short offline Completed without error 00% 21948 - 1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1
A few things to notice:
- The disk has a sector size of 512 bytes logical, 4096 bytes physical. We need this info later.
- SMART support is enabled. Good.
- SMART still passes. So the disk is still usable. As long as it lasts, that is.
- There is 1 Current_Pending_Sector. So there is 1 "bad block" detected, which is not fixed. This can be fixed, and perhaps it was only just detected (after the SMART test), but it is not good, and should decrease to 0.
- The disk has 5 Multi_Zone_Error_Rate. Anything larger than 0 is worrying, and an indication that the disk is starting to fail. For me, this would be a trigger to buy a replacement disk.
- The first sector that is failing is 310949139.
Note that the logical block is reported. Logical (512 byte) block #310949139 is failing, which lies in (4096 byte) physical block #38868642 (310949139 divided by 8, rounded down), which spans logical blocks 310949136-310949143.
SSD (flash memory)
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 116 116 050 Pre-fail Always - 0/133289824 5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0 <---- this seems fine 9 Power_On_Hours_and_Msec 0x0032 000 000 000 Old_age Always - 41791h+30m+02.160s 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 128 171 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0 <---- this seems fine 172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0 <---- this seems fine 174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 100 177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 5 181 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0 <---- this seems fine 182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0 <---- this seems fine 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 128 129 000 Old_age Always - 128 (0 127 0 129 0) 195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 0/133289824 196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0 <---- this seems fine 201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 0/133289824 204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 0/133289824 230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100 231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always - 0 233 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 2086 234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 2395 241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 2395 242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 1668
This is an example output of a (seemingly) healthy SSD. While for SSD there is less experience what parameters are important, as a first indication, you should check if the above listed values are low. If any of these start to get any larger than 5 or 10, I would suggest to replace the SSD.
Relocate a bad sector
Modern disks will automatically relocate a bad sector to one of the spare sectors when it is written to.
To write directly to a disk block, use
sysctl kern.geom.debugflags=16 dd if=/dev/ada0 of=/dev/ada0 bs=512 count=1 iseek=310949139 oseek=310949139 conv=noerror,sync sysctl kern.geom.debugflags=0
That's all. However, be careful not to make mistakes, it is easy to screw up your disks.
kern.geom.debugflags is a protection to accidentally (or maliciously) make alterations to a disk. It must be set to 16 to allow direct writing to disk with dd. The default is 0, which prevents this.
Make sure to specify the correct disk. The command is written that it reads and writes to the same sector. This is useful in case you accidentally specify the wrong disk or sector.
conv=noerror,sync is required to ensure that dd continues, even in case of an error.
iseek) parameters specify the bad sector as reported by smartctl.
If the bad block is relocated, it's content may still be lost, while the file system is not aware of that. Scrubbing a filesystem verifies the checksum of each block. This allows the filesystem to mark the file as bad and repair it (in case a mirror disk is present).
To start a scrub on ZFS:
zpool scrub poolname
To check the status of a scrub (progress and result):
zpool status -v poolname
Before taking a disk into production, some people suggest to test the disk for speed and reliability. As a consumer, I don't take these steps.
jgreco on the FreeNAS forum recommends the following steps:
- Start with a SMART conveyance test.
- Then you move on to a SMART extended test.
- Then move on to what we call burn-in, prior to making any filesystems or anything. Read all the data off each disk with dd. Write zeros to each entire disk with dd. Re-read all that data off each disk with dd.
- Do each set of tests in parallel and watch to see if any of the disks are unnaturally slower than the others, a warning flag.
- Then you make your filesystem(s).
- Then you run iozone in a seek-heavy manner. Then you keep that running a few weeks (no, seriously, weeks is on the short end).
If your system is healthy at the end of that, you've probably done as much as you can to ensure that the hardware is good.
Replacing a Disk with ZFS
Replace the faulty disk with a new one, and use the following commands to ensure you know which disk it is.
# camcontrol devlist # glabel status # gpart show -l
Assuming that you are now convinced the new disk is located in ada0, ensure there is indeed no data on the disk:
# gpart show -l ada0 gpart: No such geom: ada0.
If there is no data, create a new partition. In this example a swap disk and ZFS file system whose size matches that of a different disk.
# gpart create -s gpt ada0 ada0 created # gpart add -t freebsd-swap -a 4k -s 2G ada0 ada0p1 added # gpart add -t freebsd-zfs -a 4k -s 5856338696 ada0 ada0p2 added
# gpart show ada0 => 40 5860533088 ada0 GPT (2.7T) 40 4194304 1 freebsd-swap (2.0G) 4194344 5856338696 2 freebsd-zfs (2.7T) 5860533040 88 - free - (44K)
Then, determine the GPTID of the partition:
# glabel status Name Status Components gptid/b2fa344b-91bc-11e7-b516-bc5ff40dd410 N/A ada0p1 gptid/ba96c94b-91bc-11e7-b516-bc5ff40dd410 N/A ada0p2
Finally replace the old disk with the new filesytem with
# zpool status freenas-data pool: freenas-data state: DEGRADED [...] config: NAME STATE READ WRITE CKSUM freenas-data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 60860516858591446 UNAVAIL 0 0 0 was /dev/gptid/a38a0556-b182-11e5-894a-bc5ff40dd410 gptid/a4fb997e-b182-11e5-894a-bc5ff40dd410 ONLINE 0 0 0 # zpool replace freenas-data /dev/gptid/a38a0556-b182-11e5-894a-bc5ff40dd410 gptid/ba96c94b-91bc-11e7-b516-bc5ff40dd410 # zpool status freenas-data pool: freenas-data state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Sep 5 00:03:36 2017 5.60G scanned out of 1008G at 19.0M/s, 14h58m to go 5.59G resilvered, 0.56% done config: NAME STATE READ WRITE CKSUM freenas-data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 60860516858591446 UNAVAIL 0 0 0 was /dev/gptid/a38a0556-b182-11e5-894a-bc5ff40dd410 gptid/ba96c94b-91bc-11e7-b516-bc5ff40dd410 ONLINE 0 0 0 (resilvering) gptid/a4fb997e-b182-11e5-894a-bc5ff40dd410 ONLINE 0 0 0