Tuesday, July 31, 2012

Traffic Shaping (or a chance to show off my Visualization Porn)

On Friday I had some spare time so I rebuilt my home traffic shaping to better support my online backups.  CrashPlan has three features that are really nice for not totally annihilating your home network while it's doing online backups.  These each have their pros and cons.

The first is you can only run backups at certain times (e.g. when everyone is asleep anyway).  There are two problems with this: 1) Sometimes either me or my wife wants to watch Netflix at 3AM.  And 2) If I just got back from vacation and have 20GB to backup (not unheard of) it's going to take a week running full-bore all the time.  Cutting this back to 6 hours a day is going to make it take a month.

The second option is to limit the outbound bandwidth.  This is what I had been doing (and, in fact, what I had been doing with my home-grown online backups before using crashplan).  You can limit to using say, 2/3 of the upload pipe and then you're only adding 1/3 of the time to backup and most things work normally all the time.  The problem with this is that once you start using the rest of the upload pipe the internet stalls and nothing works.

Lets say I have a 300kbps upload (Yes, I know I could do better, but I generally don't need better, I'm cheap, and for the purposes of this example it doesn't matter. If I had 100Mbps upstream I could fill it.) and I have CrashPlan limited to 200kbps.  I then start doing something that requires around 70kbps of upload space.  Things are still working fine.  Then at the 10 minute mark something starts an upload (lets say I've decided to print some pictures to Costco) that requires another 100kbps.  Backups will reduce their usage a little because of the packet loss, but the internet is now completely unusable.  (Don't worry, that's not the visualization I teased about)

I could, of course, combine the above two options and only run at 2/3 of the bandwidth only during off course, but then backups would take forever.

The third option is that CrashPlan can set the IP ToS field on your backups.  By default this doesn't do anything.  I have an OpenWRT router sitting just inside my DSL modem and in theory it handles interactive traffic first, then unflagged traffic, and lastly high-bandwidth traffic.  In reality, though, the outbound network from the router is 100Mbps so it just throws everything down the 100Mbps network until it overflows the DSL modem's outbound buffer and then the DSL modem throws things away randomly without consulting the ToS.

The solution, then, is to force the router to shape the network. You can see my config here.  I started by classifying outbound traffic on my network into three categories:
  • Interactive -- traffic with the "lowest latency" bit set in the IP ToS.  This is mainly ssh traffic (including ssh traffic within my VPN back to work).  When I'm working on some remote system I want as little latency as possible
  • High Volume, Low Latency -- currently google voice and video chat.  I'd like to add netflix, but it's hard to identify.  This is stuff where reducing the bandwidth considerably could drop the connection
  • Normal -- everything that didn't get categorized
  • Bulk -- traffic with the "highest bandwidth" bit set in IP ToS.  This is (that I know of) CrashPlan, scp, and rsync over ssh

Next I used HTB to set up "token buckets" for each class.  Interactive gets 50k (which it will never use), High 100k, Normal 100k, and bulk 20k.  After all classes are serviced any bandwidth left (up to 330kbps, which is artificial, but close to my real max) gets handed out in priority order (interactive, high, normal, and then finally bulk, though bulk is rate limited to 95% of the connection).

Finally, I setup Stochastic Fair Queueing under each class so that even within a class a single connection couldn't shut everything else down.

Having set this up on Friday, I got a chance to test it on Saturday when I got called in to do a bunch of work while on a video conference.  I ended up running backups (with no internal rate limit), a video conference, a photo upload to Costco (gratuitously), and an interactive login to my work machine and I had about 500ms delay in my typing for work.  Then I got the idea to keep stats on it and that's what generated my Visualization Porn:

click for big

Left is kbits, bottom is minutes elapsed, sampling is every 5 seconds.  I've done some mangling of the high data because Google Video chat is a UDP service so instead of self-scaling like everything else the router just dropped a bunch of its packets on the floor and the numbers I was collecting were for packets enqueued, not packets actually sent, but for the most part this is just a stack of the four values.

What's going on here is that at around 20 minutes I started the video conference; when I did that, the high class started using all sorts of traffic, but the bulk stream dynamically resized to keep total network usage constant.  I don't know what happened at 40 minutes, but you can see that the higher-priority video stream had to reduce its bandwidth to make nearly 100k available for normal traffic.  You can also see I did an upload at around 157 minutes (the green area), which got to use the full 300k.

I'm quite happy with the ability of the more interactive sessions to take place with so little latency, but I'm almost as impressed with the rate backups scale back up.  Except for the dip at around 30 minutes, the network was 95-100% utilized the entire sample period, despite massive and rapid shifts in bandwidths for particular services.

As I type this my backups are humming along at 288kbps, my wife is watching a Netflix movie, and my interactive traffic has no noticeable lag at all.  Traffic Shaping is a beautiful thing.

No comments: