in Internet Services, Workplace

all I want for Christmas are some custom Apache modules

Operating an Apache httpd-based origin in conjunction with a CDN presents some interesting challenges and opportunities. For example, one can actually eliminate a lot of sophisticated cache control directives by trusting that the CDN will Do The Right Thing ™ when communicating with client browsers. Furthermore, implementation of a few judicious Apache modules and mod_expires directives can go a long way towards reducing origin bandwidth and load on the webservers.

However, dynamically-generated web pages (including those generated via SSI) can result in unnecessary cache evictions due to the inability to determine last modification time. In this article I’ll explore exactly why SSIs are so irritating from a CDN-interaction perspective and why all I want for Christmas is a CDN-aware mod_include and/or mod_expires, as per the title of this post.

Server-side includes (SSIs) are a powerful but lightweight mechanism of assembling HTML content on the fly. As a server administrator, I am very annoyed by people who use PHP and its include_* series of directives to do only this. You might as well just use SSI and save the overhead of invoking a PHP interpreter (plus, the fact that many PHP modules are not thread-safe means that you cannot run your Apache using a threaded MPM.) Unfortunately, using SSI inhibits Apache’s ability to tell the CDN about the cacheability of SSI pages.

To understand why, let’s review how a CDN like Akamai (a) fetches objects from the origin, and (b) determines whether an object payload should be refetched. Operation (a) is executed when neither the edge server nor its peers have the object in cache; the CDN will execute an “unconditional” HTTP request to the origin:

GET /logo.gif HTTP/1.1
Host: www.cbc.ca

which returns both the HTTP headers and the payload:

HTTP/1.1 200 OK
Server: Apache/2.2.8 (Red Hat)
Last-Modified: Fri, 18 Oct 2002 14:56:54 GMT
ETag: "45b86007-63d-3ad48c6c23980"
Accept-Ranges: bytes
Content-Length: 1597
X-Origin-Server: torlnxdpbdrhapacheweb22
Content-Type: image/gif
Cache-Control: public, max-age=0
Expires: Thu, 06 Nov 2008 04:39:38 GMT
Date: Thu, 06 Nov 2008 04:39:38 GMT
Connection: keep-alive

(with the payload following)

The CDN edge server then stores this object according to the cache key, and also stores the Last-Modified time of this object. The next time a client browser requests this object of the CDN, operation (b) will be executed by the edge server against the origin with an If-Modified-Since request:

GET /logo.gif HTTP/1.1
Host: www.cbc.ca
If-Modified-Since: Fri, 18 Oct 2002 14:56:54 GMT

In most cases, the object will not have changed and so the origin will just respond HTTP/1.1 304 Not Modified. This signals the CDN to just return the payload out of cache and not bother requesting it from the origin, thus saving bandwidth.

The problem with SSIs is that they result in a "virtual page" assembled from a number of physical objects on disk; thus, the last modified time is undefined. As such, in a naive Apache configuration, no Expires: header would be sent along to the CDN, which means that the edge servers will subsequently refetch the payload each and every time they receive a browser’s request. This defeats the whole purpose of a CDN.

The workaround is to forcibly set a TTL value for all SSI pages using mod_expires. This results in a Cache-Control: header being sent along to the edge on a response; for example, in response to a request

GET /news/index.html HTTP/1.1
Host: www.cbc.ca

our origin will respond:

HTTP/1.1 200 OK
Server: Apache/2.2.8 (Red Hat)
X-Origin-Server: torlnxdpbdrhapacheweb24
Content-Type: text/html
Cache-Control: public, max-age=120
Expires: Thu, 06 Nov 2008 04:53:23 GMT
Date: Thu, 06 Nov 2008 04:51:23 GMT
Connection: keep-alive

Cache-Control: public, max-age=120 tells the edge server to cache the payload for a maximum of two minutes (120 seconds) after which it is forcibly evicted from the cache.

Obviously, this is not actually that efficient for SSI pages, which brings me to my Christmas wish. If someone could write a custom mod_cdn_include (let’s call it) that would correctly calculate the "virtual modification time" of an SSI page according to some algorithm, I would be very happy. We would no longer have to add Cache-Control: headers in order to cause the CDN to blindly evict SSI objects from cache after a certain period of time.

The virtual modification time could be trivially calculated as the latest modification time of all sub-requested objects from an SSI object. The only reason I can see why this hasn’t yet been done is that it requires carrying a chunk of data through the SSI processing, to keep track of the maximum modification time. Maybe some enterprising Apache hacker can come up with a solution?

I apologize in advance for making fun of the way some of our server admins would love to name our origin servers. They’re not actually called that.