IR Site is *finally* back to normal|
(Tuesday, July 31, 2007 - 11:18 EDT)
It was a long and hairy battle, but the IR servers are now back to normal...
Apologies again for our outage last Friday through Sunday, it was testimony that the best laid plans of mice and men (but particularly the latter) are prone to going astray.
There are a few tidbits in our experience that might be of use to others who manage web servers, so I'll take a moment and share here a bit of what happened. (We run Linux/Apache, but the lessons here are quite general.)
The short version of the story is that a bad drive appears to have taken down the entire RAID array with it. Compounding matters, problems with the DiskSync backup system meant that a restore process that should have taken only a few hours went on for almost 20 hours.
A word of warning to users of RAID 5 systems: The redundancy built into RAID 5 will tolerate the failure of any one drive in the system, but only in the sense that it can reconstruct missing data. This is now the third time in my career that I've seen a RAID 5 system fail completely. (I ran a small systems integration company in a previous life, else I probably wouldn't have seen as many.) In the current case, it appears that one of the drives in the array failed in a way that caused it to interfere with data flowing over the SCSI bus. This meant that data from all drives became corrupt, as did any "reconstructed" data that was written back to fix an apparently failing drive. Because the fault affected data from all drives, the system wasn't able to identify which drive was the actual source of the problem. (It indicated a particular drive in the array, but that drive was actually fine, it just happened to be the drive at the particular position along the SCSI bus that was most affected by the bus problem.
So it ended up that we had to completely wipe the array (actually moving to a new server chassis and full array of entirely new hard drives), reload the OS, and restore from our online backups.
This was where the second major hassle developed: We use a system marketed by our ISP under the name of "DiskSync", and it proved horrendously unreliable. It did indeed preserve all our data (it does seem to do a good job of that), but kept hanging whenever it encountered a symlink in the directory structure. This meant that the processs proceeded by fits and starts, needing to be restarted *many* times.
It's possible that the DiskSync problems were the result of errors on the part of our admins, but the same admins had successfully used it on two prior occasions to successfully restore servers for us. Neither of those instances involved the exact configuration of our primary box, but the OS configuration was quite similar.
Whatever the case, we're now massively unimpressed with DiskSync's reliability, and wouldn't recommend it to anyone for mission-critical backup. (Just our opinion. It may be we just didn't know critical info about how to use it, but a utility that purports to be a backup solution for Linux shouldn't hang whenever it encounters a symlink, even in its default configuration.)
Going forward, in the near term, we're going to make some configuration changes to our servers so one of the secondary boxes will be able to stand in for the main server in a pinch. Performance might be lower on the secondary box, and some of the housekeeping and deployment services on the primary box wouldn't be supported, but the site itself would be able to stay up and running. Longer term, we plan to buy our own hardware and colocate it here in Atlanta, so we can get hands-on access ourselves when we need to. (Datacenter techs generally seem to be spread way too thin to give any one client their full attention during critical outages. Perhaps not true at the highest levels of hosting providers, but definitely the case at the levels we can afford to pay for.) This solution will also involve a 24/7 "hot spare" that will be sync'd with the primary server every couple of hours, so we can fully transfer operations with the flip of a virtual switch. On the face of it, this would be a more expensive solution than our current one, with backup services purchased from DiskSync. When you consider that the revenue we lost as a result of this outage could have paid for the duplicate hardware in one fell swoop though, the economics changes.
The morals of the story are twofold:
1) Absolutely do not assume that a RAID 5 won't go down. (We were very aware of this, but our recovery plan wasn't as robust as we believed it to be.)
2) View DiskSync with deep suspicion. We don't have an alternate solution to suggest at this point, but are actively looking for one. If you must use DiskSync, a) try to avoid symlinks anywhere within the directory structure you're using it to protect, and b)keep the total volume of data protected by it to an absolute minimum. Rely on the Linux rsync utility to restore the bulk of your data from a near-line repository. (We did this for the bulk of our content, but will look to doing so for more of it going forward.)
At the end of the day, some expensive lessons, hopefully some of you reading this will be able to benefit from our painful experience.
- Dave E.