
Table of contents
RAID arrays are a way we make data robust but what happens when they fail? Learn how to repair a degraded/failing RAID array step by step.
From time to time, you’ll see messages like the following on your servers.
This is an automatically generated mail message from mdadm
running on myserverhostname
A DegradedArray event had been detected on md device /dev/md/1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sda2[0]
523712 blocks super 1.2 [2/1] [U_]
md2 : active raid1 sda3[0]
1919300672 blocks super 1.2 [2/1] [U_]
bitmap: 10/15 pages [40KB], 65536KB chunk
md0 : inactive sda1[0](S)
33521664 blocks super 1.2
unused devices: <none>
Checking The Disk
The example above looks like a full disk needs replacement. You can see in the U_
section that one of the disks is not present, otherwise it would be reporting UU
.
Since sda
is being reported throughout the message, the error must be with sdb
. Sometimes, you can have issues that mention all or just a subset of disks. To double-check which disk has the error, we can pull more details about it from mdadm
.
root@myserverhostname ~ # mdadm --detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Thu Aug 25 09:03:16 2016
Raid Level : raid1
Array Size : 1919300672 (1830.39 GiB 1965.36 GB)
Used Dev Size : 1919300672 (1830.39 GiB 1965.36 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Apr 12 09:41:15 2017
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Name : rescue:2
UUID : 231365e9:2a24b289:5e1dc18c:f48947e3
Events : 199975
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 0 0 1 removed
1 8 19 - faulty spare /dev/sdb3
You can see towards the bottom of the output, it’s listed the sda
disk as active but the sdb
disk as faulty/spare.
Run A Disk Health Check
To fully check the health of the disk, we can use SMART to run a diagnostics check.
root@myserverhostname:~ # smartctl -H /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.2.0-42-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 5052
5 Reallocated_Sector_Ct 0x0033 133 133 140 Pre-fail Always FAILING_NOW 2071
Getting The Drive Serial Number
It’s always important to double-check you’re about to take out the right disk 😜 You’re probably going to need something like the serial number to verify you got the right one.
bernard@myserverhostname:~$ udevadm info --query=all --name=/dev/sdb | grep ID_SERIAL
E: ID_SERIAL=WDC_WD3000FYYZ-01UL1B3_WD-WMC130F4L69U
E: ID_SERIAL_SHORT=WD-WMC130F4L69U
Replacing The Disk
Physically replace the disk and ensure it’s recognised by the system.
Before we start, ensure the new disk isn’t in use:
root@myserverhostname:~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:16 0 1.8T 0 disk
├─sda1 8:17 0 2G 0 part
│ └─md0 9:0 0 2G 0 raid1
├─sda2 8:18 0 512M 0 part
│ └─md1 9:1 0 511.4M 0 raid1 /boot
└─sda3 8:19 0 1.8T 0 part
└─md2 9:2 0 1.8T 0 raid1 /
sdb 8:0 0 1.8T 0 disk
The lack of md
-prefixed partitions shows the disk isn’t in use.
Copy The Partition Table From A Healthy Drive
root@myserverhostname:~ # sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK
Disk /dev/sdb: 243201 cylinders, 255 heads, 63 sectors/track
sfdisk: ERROR: sector 0 does not have an msdos signature
/dev/sdb: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = sectors of 512 bytes, counting from 0
Device Boot Start End #sectors Id System
/dev/sdb1 2048 4196351 4194304 fd Linux raid autodetect
/dev/sdb2 4196352 5244927 1048576 fd Linux raid autodetect
/dev/sdb3 5244928 3907027119 3901782192 fd Linux raid autodetect
/dev/sdb4 0 - 0 0 Empty
Warning: partition 1 does not end at a cylinder boundary
Warning: partition 2 does not start at a cylinder boundary
Warning: partition 2 does not end at a cylinder boundary
Warning: partition 3 does not start at a cylinder boundary
Warning: partition 3 does not end at a cylinder boundary
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table
Re-reading the partition table ...
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
Check That Both Partition Tables Are The Same
root@myserverhostname:~ # fdisk -l /dev/sda /dev/sdb
Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000348dc
Device Boot Start End Blocks Id System
/dev/sda1 2048 50333696 25165824+ fd Linux raid autodetect
/dev/sda2 50335744 51384320 524288+ fd Linux raid autodetect
/dev/sda3 51386368 3907027120 1927820376+ fd Linux raid autodetect
Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c4467
Device Boot Start End Blocks Id System
/dev/sdb1 2048 50333696 25165824+ fd Linux raid autodetect
/dev/sdb2 50335744 51384320 524288+ fd Linux raid autodetect
/dev/sdb3 51386368 3907027120 1927820376+ fd Linux raid autodetect
Install The GRUB Bootloader To The New Drive
We’ve replaced the drive which probably means that the bootloader installed when the OS was installed is no longer active. This isn’t a mandatory step but if the other drive fails later, it might cause some issues with booting into the operating system.
grub-install /dev/sdb
Re-Add RAID Paritions
Finally, we need to add the disk back to the RAID array.
root@myserverhostname:~ # mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: added /dev/sdb1
root@myserverhostname:~ # mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: added /dev/sdb2
root@myserverhostname:~ # mdadm --manage /dev/md2 --add /dev/sdb3
mdadm: added /dev/sdb3
The new disk will have to resync with the array. To check on the progress, you can check the /proc/mdstat
file for updates.
root@myserverhostname:~ # cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[2] sdb1[1]
25149312 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[1]
523968 blocks super 1.2 [2/2] [UU]
md2 : active raid1 sda3[2] sdb3[1]
1927689152 blocks super 1.2 [2/1] [_U]
[=========>...........] recovery = 49.1% (948001664/1927689152) finish=431.9min speed=37796K/sec
unused devices: <none>
Possible Errors
One of the RAID Devices becomes inactive
root@myserverhostname:~ # cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : inactive sdb1[1](S) 25149440 blocks super 1.2
md1 : active raid1 sdb2[1]
523968 blocks super 1.2 [2/1] [_U]
md2 : active raid1 sdb3[1]
1927689152 blocks super 1.2 [2/1] [_U]
unused devices: <none>
It’s a cliche but the most common solution to stop and restart the device in question.
root@myserverhostname:~ # /var/log # mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@myserverhostname:~ # /var/log # mdadm --assemble --scan
mdadm: /dev/md/0 has been started with 1 drive (out of 2).
Adding a partition to a RAID device fails with “does not have a valid v1.2 superblock” error output by dmesg.
Superblock-like bytes may be hanging around on the partition. These can be zero’d out to repair the device. In this case, I was trying to add /dev/sdb2
to /dev/md1
.
root@myserverhostname:~ # mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: add new device failed for /dev/sdb2 as 2: Invalid argument
root@myserverhostname:~ # dd of=/dev/sdb2 if=/dev/zero bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00396101 s, 265 MB/s
root@myserverhostname:~ # mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: added /dev/sdb2
One of the partitions of the old drive remains a member of a RAID array
This is most common when the new drive is mounted to a different letter (e.g. /dev/sdc
). A previous drive (e.g. /dev/sda3
) has remained a member of an array partition (e.g. /dev/md2
). This causes a failed disk alert to trigger.
Since /dev/sda3
no longer exists in the following example, we need to use mknod to make a device file in order to remove it from the RAID array.
mdadm --detail /dev/md2
# Find the major and minor numbers for the faulty device.
# In this case they are 8 and 3 respectively.
# We use them with mknod as follows
mknod /dev/sda3 b 8 3
mdadm /dev/md2 --remove --force /dev/sda3
rm /dev/sda3
Last Note: Intentionally Degrading An Array
Sometimes, we need to intentionally degrade an array for testing or just the thrill of it. To degrade an array, mark one of the partitions on a disk as faulty and mdadm will refuse to include that partition even when an automatic scan is done (e.g. on reboot).
To mark a partition as faulty:
mdadm --manage --set-faulty /dev/md2 /dev/sda3