Thursday, March 26, 2009

Routing around outages

A fun little article about how Google deals with real-world hardware failures.  Yep, that's part of what I think about day-to-day -- not just "how do I make software do X", but "how can I build a system that will survive when a truck runs into the power substation?"
Unfortunately for my private work, I don't really have great answers for this except to build on other companies work... I don't have the resources to have multiple computers in different locations with independent redundant power, etc.  One of the cool things about my work is that there are a number of different levels you can design redundancy at, anything from hardware to to software to configuration.  For example, let's say you're designing a system to receive email and store it reliably (say, for auditing purposes).
You could buy one machine with redundant hard drives, CPUs, power supplies, etc.  You're still vulnerable to single-site disasters like an earthquake or maybe loss of ISP.  (You could have two independent links on independent fiber to the machine to mitigate this, for example.)  You'd need a machine that could deactivate some of its RAM/CPUs if it detected a fault if you were really concerned about downtime, but the nice thing about SMTP is that the sender will queue the mail until you're ready to receive it, so that's not too much of a worry.
You could buy a couple machines at different hosting locations, and store the data on a SAN or some other synchronized and replicated storage system.  Multiple sites for the frontend protects against some of the SPOFs in the previous design, and you're hoping that your SAN vendor has worked out the replication in the storage so you can rely on that for diversity at the backend.
You could write your own SMTP receiver which wouldn't commit on the SMTP transaction until the message has been written to all the destination stores. (which could be on separate machines in different locations)  This is a "software" solution to the above, and is probably fairly cheap if you end up buying virtual server space.  You still have to be wary that your VMs are actually on separate machines/locations, and you should probably verify data integrity on files between the machines, since bytes can degrade on disk.
You could configure an off-the-shelf SMTP system to do multiple deliveries, using NFS mounts over IPSec or something to ensure that it does all the deliveries before it returns a 200 or what-not after the MESG. This would be a "configuration" solution. Depending on the difficulties of configuring the disk-sharing, this might be easier or harder than the "software" solution.

No comments: