Handling Disk Failures

From Exterior Memory
Jump to: navigation, search

The best way to deal with disk failures it to assume they will happen and be prepared. See Reliable data storage for best practises.

The cause of action is to:

  1. Run a SMART test (if that hasn't been done) to find which sector on the disk fails.
  2. Write directly to the sector with dd. This forces the drive to relocate the sector to one of your extra sectors.
  3. Scrub the file system on the disk. If you have a mirrored copy, everything should be restored to normal.
  4. If you don't have a mirror, use your backup disk to restore the data.
  5. Decide if you want to replace the disk to prevent future failures.

SMART selftests

A bad block may be reported as follows:

Device: /dev/ada0, 1 Currently unreadable (pending) sectors
Device: /dev/ada0, Self-Test Log error count increased from 0 to 1

Bad blocks are best detected by running a SMART self-test, assuming that the disk has S.M.A.R.T. support.

To examine the status (including indication if a test is still running):

# smartctl -c /dev/ada0

which will include one of these results:

Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.

or

Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.

To get the log of the last completed test, run:

# smartctl -l selftest /dev/ada0

(or use smartctl -a /dev/ada0 to get all information about the disk, including the last test results).

To start a test:

# smartctl -t long /dev/ada0

or

# smartctl -t short /dev/ada0

To abort a test:

# smartctl -X /dev/ada0

Example Output

SMART parameters

Keep an eye on the following parameters:

ID Attribute mnemonic Attribute name Desired Value Meaning
1 Raw_Read_Error_Rate Read Error Rate Low (<20) Usually fixed by error correction, so not very worrysome. For seagate, only look at the first 32 bits of this 64 bit number.
2 Throughput_Performance Throughput Performance High (>100) Performance criterium only.
3 Spin_Up_Time Spin-Up Time Low (<1000) Value in ms. Performance criterium only.
5 Reallocated_Sector_Ct Reallocated Sectors Count Low (<5) Error that was succesfully fixed. If it starts increasing over time, consider replacing the disk.
7 Seek_Error_Rate Seek Error Rate Low (<10) Not immediately an issue, but could be a sign of starting mechanical failure. For seagate, only look at the most significant 32 bits of this 64 bit number. The least significant 32 bit indicate the number of seeks.
10 Seek_Error_Rate Spin Retry Count Low (<5) Not immediately an issue, but could be a sign of starting failure of the motor.
22 N/A Current Helium Level 100 (>95) Not an immediate issue. Specific to He8 drives from HGST. This value measures the helium inside of the drive.
181 Program_Fail_Count Non-4K Aligned Access Count 0 Not a disk issue, but if >0, indication that your file system partitions are not aligned with disk blocks. This is a performance issue, that can be fixed by repartitioning your disk.
182 Erase_Fail_Count Erase Fail Count Low (<10) Not an immediate issue, but could be a sign of wear of a SSD.
184 N/A End-to-End error / IOEDC Low (<10) IO Error Detected by flipped parity bit. SSD.
187 Reported_Uncorrect Reported Uncorrectable Errors Zero (0) If non-zero, replace your disk. It has failed.
196 Reallocated_Event_Count Reallocation Event Count Low (<10) A block was reallocated. Either successful or unsuccessful. Should remain low. If it increases over time, replace your disk.
197 Current_Pending_Sector Current Pending Sector Count Zero (<2) A bad block was detected, but not yet relocated. For example, because it could not be read, and the disk does not know what to write at the new location. This may still be fixed, in which case this value should decrease to 0. Do another file system test, followed by a smart check. Perhaps a reboot. If it remains high after that, or if the uncorrectable sector count increases, replace the disk.
198 Offline_Uncorrectable Uncorrectable Sector Count Zero (0) If non-zero, replace your disk. It has failed.
199 UDMA_CRC_Error_Count UltraDMA CRC Error Count Zero (0) Errors in data transfer via the interface cable. If non-zero, replace your disk. It has failed.
200 Multi-Zone Error Rate Uncorrectable Sector Count Zero (0) Errors found when writing a sector. If non-zero, replace your disk. It has failed.
201 Unc_Soft_Read_Err_Rate Soft Read Error Rate Zero (0) Uncorrectable software read errors. If non-zero, replace your disk. It has failed.
204 Soft_ECC_Correct_Rate Soft ECC Correction Low (<20) Errors corrected by the internal error correction software.

Hard Disk (spinning platters)

# smartctl -a /dev/ada0

=== START OF INFORMATION SECTION ===
...
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
...
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   155   150   021    Pre-fail  Always       -       9225
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       91
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       22013
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       89
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       74
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       914156
194 Temperature_Celsius     0x0022   108   107   000    Old_age   Always       -       44
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1   <---- this ought to be 0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       5   <---- this is worrying.

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22010         -
# 2  Extended offline    Completed without error       00%     21999         -
# 3  Short offline       Completed without error       00%     21966         -
# 4  Short offline       Completed without error       00%     21965         -
# 5  Short offline       Completed without error       00%     21964         -
# 6  Short offline       Completed without error       00%     21963         -
# 7  Short offline       Completed without error       00%     21962         -
# 8  Short offline       Completed without error       00%     21961         -
# 9  Short offline       Completed without error       00%     21960         -
#10  Short offline       Completed without error       00%     21959         -
#11  Short offline       Completed without error       00%     21958         -
#12  Short offline       Completed: read failure       90%     21957         310949139
#13  Short offline       Completed without error       00%     21956         -
#14  Short offline       Completed without error       00%     21955         -
#15  Short offline       Completed without error       00%     21954         -
#16  Short offline       Completed without error       00%     21953         -
#17  Short offline       Completed without error       00%     21952         -
#18  Short offline       Completed without error       00%     21951         -
#19  Short offline       Completed without error       00%     21950         -
#20  Short offline       Completed without error       00%     21949         -
#21  Short offline       Completed without error       00%     21948         -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

A few things to notice:

  • The disk has a sector size of 512 bytes logical, 4096 bytes physical. We need this info later.
  • SMART support is enabled. Good.
  • SMART still passes. So the disk is still usable. As long as it lasts, that is.
  • There is 1 Current_Pending_Sector. So there is 1 "bad block" detected, which is not fixed. This can be fixed, and perhaps it was only just detected (after the SMART test), but it is not good, and should decrease to 0.
  • The disk has 5 Multi_Zone_Error_Rate. Anything larger than 0 is worrying, and an indication that the disk is starting to fail. For me, this would be a trigger to buy a replacement disk.
  • The first sector that is failing is 310949139.

Note that the logical block is reported. Logical (512 byte) block #310949139 is failing, which lies in (4096 byte) physical block #38868642 (310949139 divided by 8, rounded down), which spans logical blocks 310949136-310949143.

SSD (flash memory)

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   116   050    Pre-fail  Always       -       0/133289824
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0   <---- this seems fine
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       41791h+30m+02.160s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       128
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0   <---- this seems fine
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0   <---- this seems fine
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       100
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       5
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0   <---- this seems fine
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0   <---- this seems fine
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   128   129   000    Old_age   Always       -       128 (0 127 0 129 0)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/133289824
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0   <---- this seems fine
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/133289824
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/133289824
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       2086
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       2395
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       2395
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       1668

This is an example output of a (seemingly) healthy SSD. While for SSD there is less experience what parameters are important, as a first indication, you should check if the above listed values are low. If any of these start to get any larger than 5 or 10, I would suggest to replace the SSD.

Relocate a bad sector

Modern disks will automatically relocate a bad sector to one of the spare sectors when it is written to.

To write directly to a disk block, use dd.

sysctl kern.geom.debugflags=16
dd if=/dev/ada0 of=/dev/ada0 bs=512 count=1 iseek=310949139 oseek=310949139 conv=noerror,sync
sysctl kern.geom.debugflags=0

That's all. However, be careful not to make mistakes, it is easy to screw up your disks.

The kern.geom.debugflags is a protection to accidentally (or maliciously) make alterations to a disk. It must be set to 16 to allow direct writing to disk with dd. The default is 0, which prevents this.

Make sure to specify the correct disk. The command is written that it reads and writes to the same sector. This is useful in case you accidentally specify the wrong disk or sector.

The conv=noerror,sync is required to ensure that dd continues, even in case of an error.

The oseek (and iseek) parameters specify the bad sector as reported by smartctl.

ZSF scrubbing

If the bad block is relocated, it's content may still be lost, while the file system is not aware of that. Scrubbing a filesystem verifies the checksum of each block. This allows the filesystem to mark the file as bad and repair it (in case a mirror disk is present).

To start a scrub on ZFS:

zpool scrub poolname

To check the status of a scrub (progress and result):

zpool status -v poolname

Pre-production testing

Before taking a disk into production, some people suggest to test the disk for speed and reliability. As a consumer, I don't take these steps.

jgreco on the FreeNAS forum recommends the following steps:

  1. Start with a SMART conveyance test.
  2. Then you move on to a SMART extended test.
  3. Then move on to what we call burn-in, prior to making any filesystems or anything. Read all the data off each disk with dd. Write zeros to each entire disk with dd. Re-read all that data off each disk with dd.
  4. Do each set of tests in parallel and watch to see if any of the disks are unnaturally slower than the others, a warning flag.
  5. Then you make your filesystem(s).
  6. Then you run iozone in a seek-heavy manner. Then you keep that running a few weeks (no, seriously, weeks is on the short end).

If your system is healthy at the end of that, you've probably done as much as you can to ensure that the hardware is good.

Replacing a Disk with ZFS

Replace the faulty disk with a new one, and use the following commands to ensure you know which disk it is.

# camcontrol devlist
# glabel status
# gpart show -l

Assuming that you are now convinced the new disk is located in ada0, ensure there is indeed no data on the disk:

# gpart show -l ada0
gpart: No such geom: ada0.

If there is no data, create a new partition. In this example a swap disk and ZFS file system whose size matches that of a different disk.

# gpart create -s gpt ada0
ada0 created
# gpart add -t freebsd-swap -a 4k -s 2G ada0
ada0p1 added
# gpart add -t freebsd-zfs -a 4k -s 5856338696 ada0
ada0p2 added
# gpart show ada0
=>        40  5860533088  ada0  GPT  (2.7T)
          40     4194304     1  freebsd-swap  (2.0G)
     4194344  5856338696     2  freebsd-zfs  (2.7T)
  5860533040          88        - free -  (44K)

Then, determine the GPTID of the partition:

# glabel status
                                      Name  Status  Components
gptid/b2fa344b-91bc-11e7-b516-bc5ff40dd410     N/A  ada0p1
gptid/ba96c94b-91bc-11e7-b516-bc5ff40dd410     N/A  ada0p2

Finally replace the old disk with the new filesytem with zpool replace:

# zpool status freenas-data

  pool: freenas-data
 state: DEGRADED
[...]
config:

    NAME                                            STATE     READ WRITE CKSUM
    freenas-data                                    DEGRADED     0     0     0
      mirror-0                                      DEGRADED     0     0     0
        60860516858591446                           UNAVAIL      0     0     0  was /dev/gptid/a38a0556-b182-11e5-894a-bc5ff40dd410
        gptid/a4fb997e-b182-11e5-894a-bc5ff40dd410  ONLINE       0     0     0

# zpool replace freenas-data /dev/gptid/a38a0556-b182-11e5-894a-bc5ff40dd410  gptid/ba96c94b-91bc-11e7-b516-bc5ff40dd410

# zpool status freenas-data
  pool: freenas-data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Sep  5 00:03:36 2017
        5.60G scanned out of 1008G at 19.0M/s, 14h58m to go
        5.59G resilvered, 0.56% done
config:

    NAME                                              STATE     READ WRITE CKSUM
    freenas-data                                      DEGRADED     0     0     0
      mirror-0                                        DEGRADED     0     0     0
        replacing-0                                   UNAVAIL      0     0     0
          60860516858591446                           UNAVAIL      0     0     0  was /dev/gptid/a38a0556-b182-11e5-894a-bc5ff40dd410
          gptid/ba96c94b-91bc-11e7-b516-bc5ff40dd410  ONLINE       0     0     0  (resilvering)
        gptid/a4fb997e-b182-11e5-894a-bc5ff40dd410    ONLINE       0     0     0