Tuesday, July 31, 2012

Traffic Shaping (or a chance to show off my Visualization Porn)

On Friday I had some spare time so I rebuilt my home traffic shaping to better support my online backups.  CrashPlan has three features that are really nice for not totally annihilating your home network while it's doing online backups.  These each have their pros and cons.

The first is you can only run backups at certain times (e.g. when everyone is asleep anyway).  There are two problems with this: 1) Sometimes either me or my wife wants to watch Netflix at 3AM.  And 2) If I just got back from vacation and have 20GB to backup (not unheard of) it's going to take a week running full-bore all the time.  Cutting this back to 6 hours a day is going to make it take a month.

The second option is to limit the outbound bandwidth.  This is what I had been doing (and, in fact, what I had been doing with my home-grown online backups before using crashplan).  You can limit to using say, 2/3 of the upload pipe and then you're only adding 1/3 of the time to backup and most things work normally all the time.  The problem with this is that once you start using the rest of the upload pipe the internet stalls and nothing works.



Lets say I have a 300kbps upload (Yes, I know I could do better, but I generally don't need better, I'm cheap, and for the purposes of this example it doesn't matter. If I had 100Mbps upstream I could fill it.) and I have CrashPlan limited to 200kbps.  I then start doing something that requires around 70kbps of upload space.  Things are still working fine.  Then at the 10 minute mark something starts an upload (lets say I've decided to print some pictures to Costco) that requires another 100kbps.  Backups will reduce their usage a little because of the packet loss, but the internet is now completely unusable.  (Don't worry, that's not the visualization I teased about)

I could, of course, combine the above two options and only run at 2/3 of the bandwidth only during off course, but then backups would take forever.

The third option is that CrashPlan can set the IP ToS field on your backups.  By default this doesn't do anything.  I have an OpenWRT router sitting just inside my DSL modem and in theory it handles interactive traffic first, then unflagged traffic, and lastly high-bandwidth traffic.  In reality, though, the outbound network from the router is 100Mbps so it just throws everything down the 100Mbps network until it overflows the DSL modem's outbound buffer and then the DSL modem throws things away randomly without consulting the ToS.

The solution, then, is to force the router to shape the network. You can see my config here.  I started by classifying outbound traffic on my network into three categories:
  • Interactive -- traffic with the "lowest latency" bit set in the IP ToS.  This is mainly ssh traffic (including ssh traffic within my VPN back to work).  When I'm working on some remote system I want as little latency as possible
  • High Volume, Low Latency -- currently google voice and video chat.  I'd like to add netflix, but it's hard to identify.  This is stuff where reducing the bandwidth considerably could drop the connection
  • Normal -- everything that didn't get categorized
  • Bulk -- traffic with the "highest bandwidth" bit set in IP ToS.  This is (that I know of) CrashPlan, scp, and rsync over ssh

Next I used HTB to set up "token buckets" for each class.  Interactive gets 50k (which it will never use), High 100k, Normal 100k, and bulk 20k.  After all classes are serviced any bandwidth left (up to 330kbps, which is artificial, but close to my real max) gets handed out in priority order (interactive, high, normal, and then finally bulk, though bulk is rate limited to 95% of the connection).

Finally, I setup Stochastic Fair Queueing under each class so that even within a class a single connection couldn't shut everything else down.

Having set this up on Friday, I got a chance to test it on Saturday when I got called in to do a bunch of work while on a video conference.  I ended up running backups (with no internal rate limit), a video conference, a photo upload to Costco (gratuitously), and an interactive login to my work machine and I had about 500ms delay in my typing for work.  Then I got the idea to keep stats on it and that's what generated my Visualization Porn:


click for big


Left is kbits, bottom is minutes elapsed, sampling is every 5 seconds.  I've done some mangling of the high data because Google Video chat is a UDP service so instead of self-scaling like everything else the router just dropped a bunch of its packets on the floor and the numbers I was collecting were for packets enqueued, not packets actually sent, but for the most part this is just a stack of the four values.


What's going on here is that at around 20 minutes I started the video conference; when I did that, the high class started using all sorts of traffic, but the bulk stream dynamically resized to keep total network usage constant.  I don't know what happened at 40 minutes, but you can see that the higher-priority video stream had to reduce its bandwidth to make nearly 100k available for normal traffic.  You can also see I did an upload at around 157 minutes (the green area), which got to use the full 300k.

I'm quite happy with the ability of the more interactive sessions to take place with so little latency, but I'm almost as impressed with the rate backups scale back up.  Except for the dip at around 30 minutes, the network was 95-100% utilized the entire sample period, despite massive and rapid shifts in bandwidths for particular services.

As I type this my backups are humming along at 288kbps, my wife is watching a Netflix movie, and my interactive traffic has no noticeable lag at all.  Traffic Shaping is a beautiful thing.


Friday, July 20, 2012

BC/DR (Part 2): Or, why I left Time Machine

If you read my last post, it might surprise you to find I'm in the process of abandoning Time Machine.  I still think Time Machine is a great product.  In fact, I not only think it is vastly superior to what's probably the most common "backup" mechanism: RAID, and the even more common lack of a backup at all, I think there are areas where it outshines pretty much every other backup system out there.  Specifically, if you boot a Mac off of a Mac install disk, it will ask you if you have a Time Machine backup you want to restore and just do the restore work for you.  I don't know of any other consumer backup solution that has a bootable restore procedure and it's getting to be impossible to find an enterprise solution that can do this.  It's almost impossible for me to overstate how much this lowers your RTO.

Steps to restore from a backup with Time Machine:
1) Install replacement hard drive and stick OS CD in drive
2) Hit "Yes, I want to restore from Time Machine" in boot.
3) Done (I should note I haven't tried this)

Steps to restore from a backup with pretty much anything else:
1) Install replacement hard drive and stick OS CD in drive
2) Install OS
3) Probably install OS patches since your CD is too out of date to run backup software
4) Install backup software
5) Do restore
6) Fix all the stuff that's now broken because the restore libraries aren't compatible with the OS libraries the restore was missing

But for all that, the Pro/Con matrix on Time Machine is still slanted heavily Con for me:

Advantages of Time Machine

  • Backups are stored as normal OS files and thus can be read like normal files
  • Backup/restore software comes with OS, so there's no separate install and restore is extremely easy
  • Setup is nearly trivial, restore is easy and well segregated.  Even respects OS permissions and allows non-admin users to self-restore
  • Self maintains versioning and cleanup

Disadvantages of Time Machine

  • Only runs on Mac
  • You can't change the retention policy
  • De-duplication is done at the file level, not the block level, so if you import 30G of HD video into iMovie and then change the event names (which changes the folder names), Time Machine will create brand new copies.
  • It can't verify a backup is correct, and if one isn't correct, it can't fix it.

My home system has been running Time Machine for 2 years.  I just went and ran diff -qr between the current filesystem and the last Time Machine backup.  There are several files with different contents and a couple of monitor profiles from May of this year that are missing.  None of these particular files are the end of the world, but the problem isn't that these files have incorrect versions, it's that they've managed to keep inaccuracies for months and I didn't know.  Not even that, now that I know it's wrong the only way for me to fix it is to modify the real files so that it will catch the change.  There is no command to have Time Machine scan the entire filesystem and compare what's there to what it thinks is there.  This, to me, is a deal killer.


The system I'm currently building has three parts:

  1. complete, bootable copy of my main hard disk in a USB/SATA enclosure.  In this case I'm particular about the disk.  It's the same as the actual main disk, so if it were removed from the enclosure it could be a drop in replacement for the real hard disk.
  2. second internal disk with a local CrashPlan backup
  3. CrashPlan+ backup to the cloud

This is a relatively expensive strategy (about $150 up front for the disks plus $3 per month for cloud storage), but it gives me several things:


In a disk or total failure, I have a bootable, reasonably recent image.  This speeds up recovery tremendously.  Except for a total, catastrophic, and immediate failure while I'm updating the USB backup, I should only have a gigabyte or so to fetch from a real backup (either local for a disk failure or the cloud for a catastrophic one).  Let's say the house burns down.  My recovery procedure is to go to work and fetch my USB disk, build a new computer around it.  Boot, then recover the rest from CrashPlan.  RPO: nearly immediate.  RTO: about long as it takes to get a replacement computer.


I'm not trusting the cloud.  CrashPlan+ is cheap for online backup (about $3 per month), but I don't trust it.  Lets say CrashPlan loses my backups while my house is burning down.  Admittedly, this seems unlikely, but I've seen reports from most of the cloud services that data has been lost for some small number of users.  My recovery goes back a couple months (more recently if I've dumped a bunch of pictures in and felt like I needed a backup).  RPO: a couple months.  RTO: getting a new computer.


I'm not trusting a disk that's offline.  Like above you can generally trust a disk sitting on a shelf, but you never know for sure until you actually run the restore, which is too late if it's failed.  If I lose the disk entirely I have to rebuild from install DVDs and then get the data from crashplan (which is $150 to have them ship it to me on a replacement disk).  RPO: immediate.  RTO: get a computer plus a day or so.

I'm not yet committed to this and would certainly accept suggestions on better or cheaper ways to do it. I insist at least on having a bootable copy, preferably offsite and a recent snapshot, also preferably offsite.

Disaster Recovery and Business Continuity (part 1)

This has been entirely a political blog lately, but that's more because I haven't really had any personal stuff to relate than because it's really intended to be purely a political blog.  Today, though, I want to relate something that might have a more immediate impact on people's lives (and that happens to relate closely to my profession):  computer backups.

A decade and a half ago it was bizarre beyond belief that I backed up my personal computer.  These days it's still probably the minority of people who actually backup their computer, but most people at least think it's a good idea.  Even among people who have backups, though, most of the strategies aren't that well thought out.  For instance all major desktop OSes these days support RAID out of the box, so I wouldn't be surprised to find that there's a significant percentage of people who are relying on a disk mirror (two disks that get written simultaneously) for backup.  If you're doing that then you're probably never going to lose all your data (as opposed to your next door neighbor who just has one disk.  He's probably going to suffer complete loss at some point) but you have a badly designed system for a desktop.

RAID is not a substitute for a backup.  If the server gets hacked or somebody accidentally removes stuff that needs to be there or the stars align just wrong and bad data gets copied to the good disk, you're still up a creek.  So server admins also make backups.  And they ship them offsite in case the whole building gets destroyed.

Now maybe that's too much work for a home user.  After all, if your whole house burned down the last thing you're going to be thinking about is recovering your family pictures from two years ago, right?  Hmm, I don't know about you, but if I could take one thing out of my house it would be my family pictures.  So why not do it now so that we don't have to worry about it while it's burning down?

There's two things you need to know about Disaster Recovery (DR) planning:

Recovery Point Objective (RPO) - How far back from an "event" (computer being destroyed) do we have to go on recovery.

Recovery Time Objective (RTO) - How long does it take to get back up and running.

I'm going to consider three scenarios for computing our efficacy: File deletion, Single Disk failure, and Total and Catastrophic failure (house burns down).  Let's take a simple RAID first:
FileDiskTotal
no recoveryRPO: immediate
RTO: immediate
no recovery

As you can see, RAID is very well situated to handle a disk failure, but if you accidentally deleted all the pictures you took in 2008 when you meant to delete something else you can never recover.

Another strategy would be to get a USB disk, make a copy to it every week and store it in your office (assuming that's not your house):

FileDiskTotal
RPO: one week
RTO: one day
RPO: one week
RTO: one day
RPO: one week
RTO: time to build a new computer


As you can see in this case making a copy of the disk and sending it offsite every week causes us to lose a week's work (or irreplaceable pictures if we've erased our memory card) but as long as we know the drive is good when we send it offsite we at least have a backup, even if our house burns down we can recover.

One backup strategy you'll encounter, which I actually like, is to get two external disks with firewire/eSATA/Thunderbolt enclosures (not USB, you want fast) and swap them in and out of a mirror while keeping the other one offsite.  This gets you the best of both of the above, but it still has a fatal flaw: it's unbelievably annoying to truck disks back and forth and thus isn't really going to happen.

For a long time I used a RAID on my home disks and a set of TR-3 tapes and later CDs for offsite backups, which is sort of like this.  It took about 10 CDs at the time and I managed to actually make a backup maybe once a year.  I had a process for building incrementals so I didn't have to do the full backup all the time, but I still never remembered to make one.

When I switched to a Mac, Time Machine revolutionized how I looked at desktop backups.  RAID was designed for systems that can't go down just because they lose a disk.  Chances are pretty good that if you lose your home desktop for a couple days while you go buy a new disk and do a restore, it's not the end of the world (and in fact you almost certainly don't, as every data center does, have either a complete set of parts to replace other failed components or a contract to have them couriered to you).   At any rate, you're probably not willing to pay $80 for an extra disk purely to take your RTO down from one day to immediate from a failure that happens roughly every 30-60 years on a single disk machine.  Time Machine makes incremental backups every hour (or on demand) and keeps them going back practically forever:

FileDiskTotal
RPO: one hour
RTO: nearly immediate
RPO: one hour
RTO: time to purchase a disk
no recovery

This is a huge improvement over RAID because accidental file deletion is probably the most common failure state.  And that's really over-estimating the RPO.  If you just dumped pictures of your daughter's wedding in there you can force a time machine backup right-now and not delete the memory card until it finishes.  After I saw how this worked and started thinking about it I got rid of my RAID and started doing Time Machine plus a third disk offsite.  I have been using that for a few years, but I'm now thinking about the best architecture for the present.