RAID arrays are a way we make data robust but what happens when they fail? Learn how to repair a degraded/failing RAID array step by step.

From time to time, you’ll see messages like the following on your servers.

This is an automatically generated mail message from mdadm
running on myserverhostname

A DegradedArray event had been detected on md device /dev/md/1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0]
1919300672 blocks super 1.2 [2/1] [U_]
bitmap: 10/15 pages [40KB], 65536KB chunk

md0 : inactive sda1[0](S)
33521664 blocks super 1.2

unused devices: <none>

Checking The Disk

The example above looks like a full disk needs replacement. You can see in the U_ section that one of the disks is not present, otherwise it would be reporting UU.

Since sda is being reported throughout the message, the error must be with sdb. Sometimes, you can have issues that mention all or just a subset of disks. To double-check which disk has the error, we can pull more details about it from mdadm.

root@myserverhostname ~ # mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Thu Aug 25 09:03:16 2016
     Raid Level : raid1
     Array Size : 1919300672 (1830.39 GiB 1965.36 GB)
  Used Dev Size : 1919300672 (1830.39 GiB 1965.36 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Apr 12 09:41:15 2017
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : rescue:2
           UUID : 231365e9:2a24b289:5e1dc18c:f48947e3
         Events : 199975

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       0        0        1      removed

       1       8       19        -      faulty spare   /dev/sdb3

You can see towards the bottom of the output, it’s listed the sda disk as active but the sdb disk as faulty/spare.

Run A Disk Health Check

To fully check the health of the disk, we can use SMART to run a diagnostics check.

root@myserverhostname:~ # smartctl -H /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.2.0-42-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   001   001   051    Pre-fail  Always   FAILING_NOW 5052
  5 Reallocated_Sector_Ct   0x0033   133   133   140    Pre-fail  Always   FAILING_NOW 2071

Getting The Drive Serial Number

It’s always important to double-check you’re about to take out the right disk 😜 You’re probably going to need something like the serial number to verify you got the right one.

bernard@myserverhostname:~$ udevadm info --query=all --name=/dev/sdb | grep ID_SERIAL
E: ID_SERIAL=WDC_WD3000FYYZ-01UL1B3_WD-WMC130F4L69U
E: ID_SERIAL_SHORT=WD-WMC130F4L69U

Replacing The Disk

Physically replace the disk and ensure it’s recognised by the system.

Before we start, ensure the new disk isn’t in use:

root@myserverhostname:~ # lsblk 
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda       8:16   0   1.8T  0 disk  
├─sda1    8:17   0     2G  0 part  
│ └─md0   9:0    0     2G  0 raid1 
├─sda2    8:18   0   512M  0 part  
│ └─md1   9:1    0 511.4M  0 raid1 /boot
└─sda3    8:19   0   1.8T  0 part  
  └─md2   9:2    0   1.8T  0 raid1 /
sdb       8:0    0   1.8T  0 disk

The lack of md-prefixed partitions shows the disk isn’t in use.

Copy The Partition Table From A Healthy Drive

root@myserverhostname:~ # sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK

Disk /dev/sdb: 243201 cylinders, 255 heads, 63 sectors/track

sfdisk: ERROR: sector 0 does not have an msdos signature
 /dev/sdb: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1          2048   4196351    4194304  fd  Linux raid autodetect
/dev/sdb2       4196352   5244927    1048576  fd  Linux raid autodetect
/dev/sdb3       5244928 3907027119 3901782192  fd  Linux raid autodetect
/dev/sdb4             0         -          0   0  Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Check That Both Partition Tables Are The Same

root@myserverhostname:~ # fdisk -l /dev/sda /dev/sdb

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000348dc

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048    50333696    25165824+  fd  Linux raid autodetect
/dev/sda2        50335744    51384320      524288+  fd  Linux raid autodetect
/dev/sda3        51386368  3907027120  1927820376+  fd  Linux raid autodetect

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c4467

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048    50333696    25165824+  fd  Linux raid autodetect
/dev/sdb2        50335744    51384320      524288+  fd  Linux raid autodetect
/dev/sdb3        51386368  3907027120  1927820376+  fd  Linux raid autodetect

Install The GRUB Bootloader To The New Drive

We’ve replaced the drive which probably means that the bootloader installed when the OS was installed is no longer active. This isn’t a mandatory step but if the other drive fails later, it might cause some issues with booting into the operating system.

grub-install /dev/sdb

Re-Add RAID Paritions

Finally, we need to add the disk back to the RAID array.

root@myserverhostname:~ # mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: added /dev/sdb1
root@myserverhostname:~ # mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: added /dev/sdb2
root@myserverhostname:~ # mdadm --manage /dev/md2 --add /dev/sdb3
mdadm: added /dev/sdb3

The new disk will have to resync with the array. To check on the progress, you can check the /proc/mdstat file for updates.

root@myserverhostname:~ # cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[2] sdb1[1]
      25149312 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[2] sdb2[1]
      523968 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda3[2] sdb3[1]
      1927689152 blocks super 1.2 [2/1] [_U]
      [=========>...........]  recovery = 49.1% (948001664/1927689152) finish=431.9min speed=37796K/sec

unused devices: <none>

Possible Errors

One of the RAID Devices becomes inactive

root@myserverhostname:~ # cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]                                                                                                                       md0 : inactive sdb1[1](S)                                                                                                                                                                                         25149440 blocks super 1.2

md1 : active raid1 sdb2[1]
      523968 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb3[1]
      1927689152 blocks super 1.2 [2/1] [_U]

unused devices: <none>

It’s a cliche but the most common solution to stop and restart the device in question.

root@myserverhostname:~ # /var/log # mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@myserverhostname:~ # /var/log # mdadm --assemble --scan
mdadm: /dev/md/0 has been started with 1 drive (out of 2).

Adding a partition to a RAID device fails with “does not have a valid v1.2 superblock” error output by dmesg.

Superblock-like bytes may be hanging around on the partition. These can be zero’d out to repair the device. In this case, I was trying to add /dev/sdb2 to /dev/md1.

root@myserverhostname:~ # mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: add new device failed for /dev/sdb2 as 2: Invalid argument
root@myserverhostname:~ # dd of=/dev/sdb2 if=/dev/zero bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00396101 s, 265 MB/s
root@myserverhostname:~ # mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: added /dev/sdb2

One of the partitions of the old drive remains a member of a RAID array

This is most common when the new drive is mounted to a different letter (e.g. /dev/sdc). A previous drive (e.g. /dev/sda3) has remained a member of an array partition (e.g. /dev/md2). This causes a failed disk alert to trigger.

Since /dev/sda3 no longer exists in the following example, we need to use mknod to make a device file in order to remove it from the RAID array.

mdadm --detail /dev/md2
# Find the major and minor numbers for the faulty device.
# In this case they are 8 and 3 respectively.
# We use them with mknod as follows
mknod /dev/sda3 b 8 3
mdadm /dev/md2 --remove --force /dev/sda3
rm /dev/sda3

Last Note: Intentionally Degrading An Array

Sometimes, we need to intentionally degrade an array for testing or just the thrill of it. To degrade an array, mark one of the partitions on a disk as faulty and mdadm will refuse to include that partition even when an automatic scan is done (e.g. on reboot).

To mark a partition as faulty:

mdadm --manage --set-faulty /dev/md2 /dev/sda3

Table of contents