Thursday, March 26, 2009

Routing around outages

A fun little article about how Google deals with real-world hardware failures.  Yep, that's part of what I think about day-to-day -- not just "how do I make software do X", but "how can I build a system that will survive when a truck runs into the power substation?"
Unfortunately for my private work, I don't really have great answers for this except to build on other companies work... I don't have the resources to have multiple computers in different locations with independent redundant power, etc.  One of the cool things about my work is that there are a number of different levels you can design redundancy at, anything from hardware to to software to configuration.  For example, let's say you're designing a system to receive email and store it reliably (say, for auditing purposes).
You could buy one machine with redundant hard drives, CPUs, power supplies, etc.  You're still vulnerable to single-site disasters like an earthquake or maybe loss of ISP.  (You could have two independent links on independent fiber to the machine to mitigate this, for example.)  You'd need a machine that could deactivate some of its RAM/CPUs if it detected a fault if you were really concerned about downtime, but the nice thing about SMTP is that the sender will queue the mail until you're ready to receive it, so that's not too much of a worry.
You could buy a couple machines at different hosting locations, and store the data on a SAN or some other synchronized and replicated storage system.  Multiple sites for the frontend protects against some of the SPOFs in the previous design, and you're hoping that your SAN vendor has worked out the replication in the storage so you can rely on that for diversity at the backend.
You could write your own SMTP receiver which wouldn't commit on the SMTP transaction until the message has been written to all the destination stores. (which could be on separate machines in different locations)  This is a "software" solution to the above, and is probably fairly cheap if you end up buying virtual server space.  You still have to be wary that your VMs are actually on separate machines/locations, and you should probably verify data integrity on files between the machines, since bytes can degrade on disk.
You could configure an off-the-shelf SMTP system to do multiple deliveries, using NFS mounts over IPSec or something to ensure that it does all the deliveries before it returns a 200 or what-not after the MESG. This would be a "configuration" solution. Depending on the difficulties of configuring the disk-sharing, this might be easier or harder than the "software" solution.

Saturday, March 21, 2009

Back from silence

Apparently I've decided that I want to comment on a lot of other blogs, so maybe I should start to keep this place up again.  If nothing else, I can use it to jot down notes on things I'm thinking about.

I'm back to cycling again after the winter.  I should start jogging again soon -- the group thing ended up petering out after a few weeks because jogging in the middle of the workday was a huge scheduling hassle.

Some stuff I've been up or will be up to soon:
  • Vacation on Maui for a week.  Played with an underwater camera and enclosure.
  • Almost 40k gold in WoW, even after buying up a bunch of expensive recipes.
  • Sarth+3 in WoW (a few months back).
  • Replaced the rear tire on my bike, and replaced it again...  one of the blocks to my cycling.
  • Noodling out some visualization and presentation stuff for work.
  • Trying to find a fun new coding project.
  • Been cooking more, and talking about doing some remodeling.