RAID arrays are a way we make data robust but what happens when they fail? Learn how to repair a degraded/failing RAID array step by step.
From time to time, you’ll see messages like the following on your servers.
This is an automatically generated mail message from mdadm running on myserverhostname A DegradedArray event had been detected on md device /dev/md/1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2 523712 blocks super 1.2 [2/1] [U_] md2 : active raid1 sda3 1919300672 blocks super 1.2 [2/1] [U_] bitmap: 10/15 pages [40KB], 65536KB chunk md0 : inactive sda1(S) 33521664 blocks super 1.2 unused devices: <none>
Table Of Contents:
The example above looks like a full disk needs replacement. You can see in the
U_ section that one of the disks is not present, otherwise it would be reporting
sda is being reported throughout the message, the error must be with
sdb. Sometimes, you can have issues that mention all or just a subset of disks. To double-check which disk has the error, we can pull more details about it from
[email protected] ~ # mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Thu Aug 25 09:03:16 2016 Raid Level : raid1 Array Size : 1919300672 (1830.39 GiB 1965.36 GB) Used Dev Size : 1919300672 (1830.39 GiB 1965.36 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Wed Apr 12 09:41:15 2017 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Name : rescue:2 UUID : 231365e9:2a24b289:5e1dc18c:f48947e3 Events : 199975 Number Major Minor RaidDevice State 0 8 3 0 active sync /dev/sda3 1 0 0 1 removed 1 8 19 - faulty spare /dev/sdb3
You can see towards the bottom of the output, it’s listed the
sda disk as active but the
sdb disk as faulty/spare.
To fully check the health of the disk, we can use SMART to run a diagnostics check.
[email protected]:~ # smartctl -H /dev/sdb smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.2.0-42-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. Failed Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 5052 5 Reallocated_Sector_Ct 0x0033 133 133 140 Pre-fail Always FAILING_NOW 2071
It’s always important to double-check you’re about to take out the right disk 😜 You’re probably going to need something like the serial number to verify you got the right one.
[email protected]:~$ udevadm info --query=all --name=/dev/sdb | grep ID_SERIAL E: ID_SERIAL=WDC_WD3000FYYZ-01UL1B3_WD-WMC130F4L69U E: ID_SERIAL_SHORT=WD-WMC130F4L69U
Physically replace the disk and ensure it’s recognised by the system.
Before we start, ensure the new disk isn’t in use:
[email protected]:~ # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:16 0 1.8T 0 disk ├─sda1 8:17 0 2G 0 part │ └─md0 9:0 0 2G 0 raid1 ├─sda2 8:18 0 512M 0 part │ └─md1 9:1 0 511.4M 0 raid1 /boot └─sda3 8:19 0 1.8T 0 part └─md2 9:2 0 1.8T 0 raid1 / sdb 8:0 0 1.8T 0 disk
The lack of
md-prefixed partitions shows the disk isn’t in use.
[email protected]:~ # sfdisk -d /dev/sda | sfdisk /dev/sdb Checking that no-one is using this disk right now ... OK Disk /dev/sdb: 243201 cylinders, 255 heads, 63 sectors/track sfdisk: ERROR: sector 0 does not have an msdos signature /dev/sdb: unrecognized partition table type Old situation: No partitions found New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdb1 2048 4196351 4194304 fd Linux raid autodetect /dev/sdb2 4196352 5244927 1048576 fd Linux raid autodetect /dev/sdb3 5244928 3907027119 3901782192 fd Linux raid autodetect /dev/sdb4 0 - 0 0 Empty Warning: partition 1 does not end at a cylinder boundary Warning: partition 2 does not start at a cylinder boundary Warning: partition 2 does not end at a cylinder boundary Warning: partition 3 does not start at a cylinder boundary Warning: partition 3 does not end at a cylinder boundary Warning: no primary partition is marked bootable (active) This does not matter for LILO, but the DOS MBR will not boot this disk. Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).)
[email protected]:~ # fdisk -l /dev/sda /dev/sdb Disk /dev/sda: 2000.4 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000348dc Device Boot Start End Blocks Id System /dev/sda1 2048 50333696 25165824+ fd Linux raid autodetect /dev/sda2 50335744 51384320 524288+ fd Linux raid autodetect /dev/sda3 51386368 3907027120 1927820376+ fd Linux raid autodetect Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000c4467 Device Boot Start End Blocks Id System /dev/sdb1 2048 50333696 25165824+ fd Linux raid autodetect /dev/sdb2 50335744 51384320 524288+ fd Linux raid autodetect /dev/sdb3 51386368 3907027120 1927820376+ fd Linux raid autodetect
We’ve replaced the drive which probably means that the bootloader installed when the OS was installed is no longer active. This isn’t a mandatory step but if the other drive fails later, it might cause some issues with booting into the operating system.
Finally, we need to add the disk back to the RAID array.
[email protected]:~ # mdadm --manage /dev/md0 --add /dev/sdb1 mdadm: added /dev/sdb1 [email protected]:~ # mdadm --manage /dev/md1 --add /dev/sdb2 mdadm: added /dev/sdb2 [email protected]:~ # mdadm --manage /dev/md2 --add /dev/sdb3 mdadm: added /dev/sdb3
The new disk will have to resync with the array. To check on the progress, you can check the
/proc/mdstat file for updates.
[email protected]:~ # cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda1 sdb1 25149312 blocks super 1.2 [2/2] [UU] md1 : active raid1 sda2 sdb2 523968 blocks super 1.2 [2/2] [UU] md2 : active raid1 sda3 sdb3 1927689152 blocks super 1.2 [2/1] [_U] [=========>...........] recovery = 49.1% (948001664/1927689152) finish=431.9min speed=37796K/sec unused devices: <none>
[email protected]:~ # cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : inactive sdb1(S) 25149440 blocks super 1.2 md1 : active raid1 sdb2 523968 blocks super 1.2 [2/1] [_U] md2 : active raid1 sdb3 1927689152 blocks super 1.2 [2/1] [_U] unused devices: <none>
It’s a cliche but the most common solution to stop and restart the device in question.
[email protected]:~ # /var/log # mdadm --stop /dev/md0 mdadm: stopped /dev/md0 [email protected]:~ # /var/log # mdadm --assemble --scan mdadm: /dev/md/0 has been started with 1 drive (out of 2).
Superblock-like bytes may be hanging around on the partition. These can be zero’d out to repair the device. In this case, I was trying to add
[email protected]:~ # mdadm --manage /dev/md1 --add /dev/sdb2 mdadm: add new device failed for /dev/sdb2 as 2: Invalid argument [email protected]:~ # dd of=/dev/sdb2 if=/dev/zero bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00396101 s, 265 MB/s [email protected]:~ # mdadm --manage /dev/md1 --add /dev/sdb2 mdadm: added /dev/sdb2
This is most common when the new drive is mounted to a different letter (e.g.
/dev/sdc). A previous drive (e.g.
/dev/sda3) has remained a member of an array partition (e.g.
/dev/md2). This causes a failed disk alert to trigger.
/dev/sda3 no longer exists in the following example, we need to use mknod to make a device file in order to remove it from the RAID array.
mdadm --detail /dev/md2 # Find the major and minor numbers for the faulty device. # In this case they are 8 and 3 respectively. # We use them with mknod as follows mknod /dev/sda3 b 8 3 mdadm /dev/md2 --remove --force /dev/sda3 rm /dev/sda3
Sometimes, we need to intentionally degrade an array for testing or just the thrill of it. To degrade an array, mark one of the partitions on a disk as faulty and mdadm will refuse to include that partition even when an automatic scan is done (e.g. on reboot).
To mark a partition as faulty:
mdadm --manage --set-faulty /dev/md2 /dev/sda3
A couple times a year, I publish a newsletter with a post or two I really enjoyed writing as well as a Scott Hanselman-style list of things that made me smile! If that interests you, subscribe below :)Subscribe Here