Popular political scientist Thomas Homer-Dixon was on CBC Radio One’s The Current this morning discussing his new book, The Upside of Down. His work warns of the potential downfall of Western civilization along the lines of the demise of the Roman Empire. Although this might seem like a very gloomy topic, he makes the key point that out of catastrophe can often come good, and describes, for example, 9/11 as a squandered opportunity to, for example, reduce America’s dependence on foreign oil. But I’m getting sidetracked — and this is a journal about technology, after all, not politics.
I mention Homer-Dixon only because I wanted to draw a parallel between his valuable observation, and the aftermath of our file server crash at CBC.ca, in that a great deal of good can come out of a disaster. Since the failure, many groups in both CBC Technology and CBC.ca Operations have been working hard to analyze the root cause, prepare appropriate incident reports, and engage in discussions, both internally and externally (with our storage vendor of choice) about how to upgrade the infrastructure to minimize the probability of another failure. I choose my words carefully on purpose; as with all things technology-related, it’s impossible to promise absolute 100% reliability, so the best we can do is to make the probability of failure much smaller.
As Tod helpfully explained in his original article about the outage, the catastrophe resulted from a bad situation (filer in read-only mode) being made worse by certain actions that were taken (resetting the affected volume which pushed it offline for 2 days while a fsck executed). So it makes sense that the natural aims of any solution would be to architect a solution like this:
- if a volume goes into read-only mode as a result of filesystem corruption that the NAS controller detects, there should be some online replica to which the site could be cut-over both in read-only and read-write mode;
- if an action needs to be taken on a volume that would cause an fsck, that volume should be offline and the site served from a replica;
- whatever actions are taken to restore service must not make the situation worse.
These are some of the requirements, amongst many others, against which a solution is being architected.
It’s also clear that having a 1.2TB filesystem for the entire website is not a smart design, and that we at CBC.ca Operations need to do some work to re-architect that in a significantly smarter way. At a high level, we are trying to think of the website content that frequently changes, such as news stories and images, as a small "working set". The size of the working set will be small, just like L1 cache lines on a processor, and we’ll use an LRU-style cache-eviction strategy on a, say, quarterly basis to purge the working set to an archive area. The archive area could be very large, potentially on slower storage, and backed up less frequently, and failure of the archive area, while still bad, would not be visible by most of the audience.
The challenge, of course, is to make all of this transparent to the end-user; in an ideal solution, the physical path name of an asset would not change, and the move from the working set to the archive area would be totally transparent. Some logic will have to intercept the request, serve it from the working set if available, or serve it from the archive if not available in the working set. The question is, where can/should that logic live? In order of priority-by-wishful-thinking, the logic could live:
- In the storage device’s firmware/operating system
- In a special filesystem driver implementation on the host server; for example, a logical Union Filesystem overlaying two physical NFS exports;
- In the userland application, e.g. use of fancy mod_rewrite rules in Apache.
I think that the Union Filesystem idea holds some promise, but we’d want it to be operating system independent. A combination of options 1 and 2 would probably be the ideal scenario, with a UnionFS-like layer being implemented at the storage device firmware level. Still, I feel that this may be too esoteric of a requirement for our storage vendor (I could be wrong!) and we might end up having to go with option 3.
I hope this sheds some light on why fixing a technical problem isn’t quite as simple as some armchair sysadmins would make it out to be. While I’m not happy in the least that we had a serious storage outage, I am delighted that in the aftermath, we’re having some very good technical discussions internally and hopefully will be able to build a resilient and flexible storage infrastructure that will meet the availability requirements of the site.