torstai 16. syyskuuta 2010

Using RAID and S.M.A.R.T to save yourself from data loss and lot's of grief

I have been configuring SMART monitoring to lot of my servers and workstations, both work and home, lately. Especially when combined with RAID, SMART can help you to avoid disasters that would occur when you lose your data in an event you hard drives break.

Short primer
RAID means "Redundant Array of Independent/Inexpensive Disks" and is mostly thought something you would do on hardware level via a separate RAID-controller or motherboard that supports RAID. RAID is used to get performance improvement for your disk operations, which is the biggest bottleneck in modern systems most of the time, or to provide redundancy when you lose your drive to hardware failure.

With Linux you can use something called 'software RAID' which does not require expensive RAID-controller nor support from motherboard. All you basically need is more than one hard drive. With two hard drives you can already setup a RAID0 or RAID1.
RAID0 means that you distribute your data on two (or more) drives and get a huge performance benefit, but the downside is that if you lose either disk to HW failure, you lose all the data on both disks.
RAID1 is quite the opposite of RAID0, RAID1 mirrors the data on first drive to second drive, which means that write operations are somewhat slower but the data is now duplicated and thus safe from HW failure.
There are several other RAID levels too, like RAID5 which requires 3 drives minimum and provides some redundancy in case of disk breakage and some performance improvements. My personal favorite currently is RAID10, which is a combination of RAID0 and RAID1, meaning that you get quite good redundancy in case of failures and quite nice performance boost also. The downside of RAID10 is that you need 4 drives to get started and 'lose' two of them for the mirroring.

For more information about RAID levels check this wikipage and for information how to create RAID the hard way (remember it is easiest to create during installation by the distros installer) check out Linux Journal's article.

SMART (or actually S.M.A.R.T) on the other hand means "Self-Monitoring, Analysis, and Reporting Technology", which is a technology most, if not all, modern hard drives support. With SMART enabled drives you can gather information straight from the independent drive(s) and use that information to predict when your disk fails of old age or otherwise. Hard drives have gotten lot better since infamous times of IBM 'Deathstar' hard drives, but still the fact remains: hard drives will die of old age sooner or later. If you can get a advance warning of this impending doom for your drive, you have time to make the necessary preparations, like making backups or even replacing it before you run into risk of losing your data.

For more information check out the excellent article by Linux Journal

Getting started
I am not going to show you how to create a RAID system, it is quite easy to do when you install your operating system, openSUSE installer for example provides nice tool for creating RAID on which to install the system.

Note! You will need super user (root) privileges for (almost) all the steps below.

Step 0
Anyone can start using SMART right away, as it requires neither special hardware or special setup. Some systems might require that you enable the SMART support from your computers BIOS, so check that first. Next step is to make sure you have package called "smartmontools" installed, when you do, you can try running a command like this:

smartctl -a /dev/sda

This should output lot of information about your hard drive, but if you have not enabled SMART testing this information will be of little use. Next you will need to enable SMART testing and data collection to be automatic and done in timely manner.

Step1
First step for this is to edit config file for SMART. Usually it is /etc/smartd.conf. Open it in your favourite text editor and first comment out anything that already exist there, usually this line:

DEVICESCAN -d removable

Now you can add configuration for your drives in the end of the file. Here is example from my workstation which has 4 drives in RAID10:

#Run every Sunday offline, Saturday Long and every evening conveyance and morning short test
/dev/sda \
-H \
-l error -l selftest \
-s (O/../../7/02|L/../../6/02|C/../.././20|S/../.././01) \
-m NotUsedNow -M exec /usr/local/bin/smartd.sh -M once

#Run every Sunday offline, Saturday Long and every evening conveyance and morning short test
/dev/sdb \
-H \
-l error -l selftest \
-s (O/../../7/05|L/../../6/04|C/../.././21|S/../.././02) \
-m NotUsedNow -M exec /usr/local/bin/smartd.sh -M once

#Run every Sunday offline, Saturday Long and every evening conveyance and morning short test
/dev/sdc \
-H \
-l error -l selftest \
-s (O/../../7/08|L/../../6/06|C/../.././22|S/../.././03) \
-m NotUsedNow -M exec /usr/local/bin/smartd.sh -M once

#Run every Sunday offline, Saturday Long and every evening conveyance and morning short test
/dev/sdd \
-H \
-l error -l selftest \
-s (O/../../7/11|L/../../6/08|C/../.././23|S/../.././04) \
-m NotUsedNow -M exec /usr/local/bin/smartd.sh -M once

In above configuration tests that take long time to complete are run during weekends and shorter test are run daily. You can modify the starting time by modifying the last number in the expression.

For example this part "O/../../7/11" translates to "T/MM/DD/d/HH", first letter being the test type (Offline, Long, Conveyance or Short), rest of the fields are for scheduling the test, in this particular case we run Offline test every 7th weekday (Sunday) at 11:00 (AM for those with 12hour time disability).

The second interesting part from the above configuration is the "-M /usr/local/bin/smartd.sh" which basically tells what to do in case of problems, here it is setup to run the script /usr/local/bin/smard.sh which you will have to create in Step2.

Lastly you will need to create the above configuration only for those hard drives you actually have in your system. As mentioned I have four disks (sda, sdb, sdc and sdd) and you might have less. To see your drives you can use the following command:

ls /dev/sd?

Step2

Again with your text editor create a file /usr/local/bin/smartd.sh and paste the following and replace your email to the appropriate place:
LOGFILE="/var/log/smartd.log"
echo -e "$(date)\n$SMARTD_MESSAGE\n" >> "$LOGFILE"
mail myemail.address@somehost.com < $LOGFILE

After you have created the file above, you need to make it executable by running following command:

chmod +x /usr/local/bin/smartd.sh

Extra Step for Ubuntu users
Edit the following file /etc/default/smartmontools and uncomment the following line:

start_smartd=yes

This will make the SMART daemon start automatically during boot.

Step3
Either restart the SMART service or reboot your computer. Restarting the service works like this in openSUSE:

/etc/init.d/smartd restart

Or in Ubuntu:

/etc/init.d/smartmontools restart


That' s it, almost
Now in theory you will get an email if SMART detects problems with you hard drives, but that requires your mail daemon to be in working order. You can test this simply by executing the script we created:

./usr/local/bin/smartd.sh

You should receive email containing the empty smartd.log.

If your email notification is not working, you probably need to enable the mail daemon, but that is quite distro specific and you will need to figure it out by yourself.

Other option is to use somekind of monitor software instead. Ubuntu and Debian users can use Smart-notifier application and KDE users can use Plasmoid called Plasmart.

Now hopefully that will some day save you from disaster of losing your precious files. But in the meantime remember that neither RAID nor SMART replaces making backups, they only safeguard you against hardware failures, you will need traditional backups to safeguard against user, application and operatins system errors!

Ei kommentteja: