My colleague Blake recently wrote an article on the occasion of the decommissioning of NewsDelivery, a dynamic content display engine that until recently ran all the news stories on CBC.ca. I can’t speak for any of our alumni, but I think all of us at CBC.ca have learned one lesson:
Large websites should never, ever use dynamic rendering for content rendering.
It’s amazing how many content management systems still do not grasp this principle. On a busy site, especially one that is liable to be Slashdotted or visited heavily (say, on 11 September 2001), you do not want to be executing Java/ASP/Smalltalk/FORTRAN/whatever code every time someone visits a story. In short, you do not want CPU usage to rise proportionally to the number of visitors you have.
What you do want is to make the content rendering "system" as simple as possible; in the ideal case, you can barely call it a system. For content rendering, CBC.ca now mostly uses a bare Apache instances with server-side includes, meaning that aside from the core Apache engine, no other code needs to be executed every time you view a story.
This seems like a very simple principle, but many other news sites are still not grasping it. I can almost guarantee that if there is another 9/11-scale of event, sites that use a servlet-based dynamic execution system like The Globe and Mail and The Toronto Star will fall over under heavy load far sooner than CBC.ca. But I don’t really blame those organizations for choosing, for example, Fatwire Content Server (as in the Star’s case) because a news organization’s primary need is to create content. Displaying it is a whole separate problem entirely and the shame should be on the vendor for closely tying the two together.
They could use Tomcat if they wanted, they just need a good caching reverse proxy farm.
I think you miss my point. Caches are only any good if you have a cacheable working set; that is, if the cache hit rate is likely to be high. On a busy news site, stories are constantly changing, so the cache miss rate is likely to be very high. In the extreme case of a busy news site under heavy load (e.g. 9/11), the cache miss rate would be almost 100% as the same stories are being edited again and again in response to the news event.
My post above addresses the strategy that one should take when satisfying cache misses. Should you execute Java code on every cache miss? I don't believe so. Doing so is a sure guarantee of site failure.
CBC.ca does use heavy caching. In fact, our caching is Akamai, and we also have cache appliances at the origin with which we serve <acronym title="Cascading Style Sheets">CSS</acronym> files and images. Take note that these caches only work because the working set is (almost always) the same. If our journalists were altering CSSes every minute, all the caches in the world would not help.
So what's the difference?
Option 1 – Static
A story is updated, the server does some processing and a static file is created, ready to be served to the reader.
Option 2 – Dynamic cached
A story is updated, the cache for that page is invalidated, the cached page is served to the reader.
The difference is that on a busy news site running complex logic to display stories, the cost of cache invalidation is high, as compared to the cost of simply display static pages. Think about what happens when the cache invalidation happens:
* application server engine starts running some application code
* database connection needs to be made to retrieve new story body
* application server runs some more application code to render the story