Replacing a failed RAID drive

Here's the complete procedure I followed to replace a failed drive from a RAID array on a Debian machine.

Replace the failed drive

After seeing that /dev/sdb had been kicked out of my RAID array, I used smartmontools to identify the serial number of the drive to pull out:

smartctl -a /dev/sdb

Armed with this information, I shutdown the computer, pulled the bad drive out and put the new blank one in.

Initialize the new drive

After booting with the new blank drive in, I copied the partition table using parted.

First, I took a look at what the partition table looks like on the good drive:

$ parted /dev/sda
unit s
print

and created a new empty one on the replacement drive:

$ parted /dev/sdb
unit s
mktable gpt

then I ran mkpart for all 4 partitions and made them all the same size as the matching ones on /dev/sda.

Finally, I ran toggle 1 bios_grub (boot partition) and toggle X raid (where X is the partition number) for all RAID partitions, before verifying using print that the two partition tables were now the same.

Resync/recreate the RAID arrays

To sync the data from the good drive (/dev/sda) to the replacement one (/dev/sdb), I ran the following on my RAID1 partitions:

mdadm /dev/md0 -a /dev/sdb2
mdadm /dev/md2 -a /dev/sdb4

and kept an eye on the status of this sync using:

watch -n 2 cat /proc/mdstat

In order to speed up the sync, I used the following trick:

blockdev --setra 65536 "/dev/md0"
blockdev --setra 65536 "/dev/md2"
echo 300000 > /proc/sys/dev/raid/speed_limit_min
echo 1000000 > /proc/sys/dev/raid/speed_limit_max

Then, I recreated my RAID0 swap partition like this:

mdadm --stop /dev/md1
mdadm /dev/md1 --create --level=0 --raid-devices=2 /dev/sda3 /dev/sdb3
mkswap /dev/md1

Because the swap partition is brand new (you can't restore a RAID0, you need to re-create it), I had to update two things:

  • replace the UUID for the swap mount in /etc/fstab, with the one returned by mkswap (or running blkid and looking for /dev/md1)
  • replace the UUID for /dev/md1 in /etc/mdadm/mdadm.conf with the one returned for /dev/md1 by mdadm --detail --scan

Ensuring that I can boot with the replacement drive

In order to be able to boot from both drives, I made sure that the replacement drive was included in the list from dpkg-reconfigure grub-pc and then reinstalled the grub boot loader manually onto it:

grub-install /dev/sdb

before rebooting with both drives to first make sure that my new config works.

Then I booted without /dev/sda to make sure that everything would be fine should that drive fail and leave me with just the new one (/dev/sdb).

This test obviously gets the two drives out of sync, so I rebooted with both drives plugged in and then had to re-add /dev/sda to the RAID1 arrays:

mdadm /dev/md0 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda4

Once that finished, I rebooted again with both drives plugged in to confirm that everything is fine:

cat /proc/mdstat

Testing the new drive

Finally once everything was configured, I ran a full SMART test over the new replacement drive:

smartctl -t long /dev/sdb