Friday, July 20, 2012

Disaster Recovery and Business Continuity (part 1)

This has been entirely a political blog lately, but that's more because I haven't really had any personal stuff to relate than because it's really intended to be purely a political blog.  Today, though, I want to relate something that might have a more immediate impact on people's lives (and that happens to relate closely to my profession):  computer backups.

A decade and a half ago it was bizarre beyond belief that I backed up my personal computer.  These days it's still probably the minority of people who actually backup their computer, but most people at least think it's a good idea.  Even among people who have backups, though, most of the strategies aren't that well thought out.  For instance all major desktop OSes these days support RAID out of the box, so I wouldn't be surprised to find that there's a significant percentage of people who are relying on a disk mirror (two disks that get written simultaneously) for backup.  If you're doing that then you're probably never going to lose all your data (as opposed to your next door neighbor who just has one disk.  He's probably going to suffer complete loss at some point) but you have a badly designed system for a desktop.

RAID is not a substitute for a backup.  If the server gets hacked or somebody accidentally removes stuff that needs to be there or the stars align just wrong and bad data gets copied to the good disk, you're still up a creek.  So server admins also make backups.  And they ship them offsite in case the whole building gets destroyed.

Now maybe that's too much work for a home user.  After all, if your whole house burned down the last thing you're going to be thinking about is recovering your family pictures from two years ago, right?  Hmm, I don't know about you, but if I could take one thing out of my house it would be my family pictures.  So why not do it now so that we don't have to worry about it while it's burning down?

There's two things you need to know about Disaster Recovery (DR) planning:

Recovery Point Objective (RPO) - How far back from an "event" (computer being destroyed) do we have to go on recovery.

Recovery Time Objective (RTO) - How long does it take to get back up and running.

I'm going to consider three scenarios for computing our efficacy: File deletion, Single Disk failure, and Total and Catastrophic failure (house burns down).  Let's take a simple RAID first:
no recoveryRPO: immediate
RTO: immediate
no recovery

As you can see, RAID is very well situated to handle a disk failure, but if you accidentally deleted all the pictures you took in 2008 when you meant to delete something else you can never recover.

Another strategy would be to get a USB disk, make a copy to it every week and store it in your office (assuming that's not your house):

RPO: one week
RTO: one day
RPO: one week
RTO: one day
RPO: one week
RTO: time to build a new computer

As you can see in this case making a copy of the disk and sending it offsite every week causes us to lose a week's work (or irreplaceable pictures if we've erased our memory card) but as long as we know the drive is good when we send it offsite we at least have a backup, even if our house burns down we can recover.

One backup strategy you'll encounter, which I actually like, is to get two external disks with firewire/eSATA/Thunderbolt enclosures (not USB, you want fast) and swap them in and out of a mirror while keeping the other one offsite.  This gets you the best of both of the above, but it still has a fatal flaw: it's unbelievably annoying to truck disks back and forth and thus isn't really going to happen.

For a long time I used a RAID on my home disks and a set of TR-3 tapes and later CDs for offsite backups, which is sort of like this.  It took about 10 CDs at the time and I managed to actually make a backup maybe once a year.  I had a process for building incrementals so I didn't have to do the full backup all the time, but I still never remembered to make one.

When I switched to a Mac, Time Machine revolutionized how I looked at desktop backups.  RAID was designed for systems that can't go down just because they lose a disk.  Chances are pretty good that if you lose your home desktop for a couple days while you go buy a new disk and do a restore, it's not the end of the world (and in fact you almost certainly don't, as every data center does, have either a complete set of parts to replace other failed components or a contract to have them couriered to you).   At any rate, you're probably not willing to pay $80 for an extra disk purely to take your RTO down from one day to immediate from a failure that happens roughly every 30-60 years on a single disk machine.  Time Machine makes incremental backups every hour (or on demand) and keeps them going back practically forever:

RPO: one hour
RTO: nearly immediate
RPO: one hour
RTO: time to purchase a disk
no recovery

This is a huge improvement over RAID because accidental file deletion is probably the most common failure state.  And that's really over-estimating the RPO.  If you just dumped pictures of your daughter's wedding in there you can force a time machine backup right-now and not delete the memory card until it finishes.  After I saw how this worked and started thinking about it I got rid of my RAID and started doing Time Machine plus a third disk offsite.  I have been using that for a few years, but I'm now thinking about the best architecture for the present.

1 comment:

Evanda said...

I use the second "USB Disk" option -- I actually have two USB disks and an internal drive. Occasionally (i probably average quarterly), I use Toucan to "equalise" the disks.

One USB disk travels with me in my laptop bag. One stays in my desk at work. One is fixed in my media machine at home.

I can add content to any of the three and eventually they get synced up.

The RPO isn't great - probably a month or two, the RTO is sufficient. It's good for those leery of cloud backup solutions. My main problem is that it's mostly manual (takes 5 minutes to start up and then 40 minutes to complete) which is why i do it so infrequently.