Having a reverse-proxy web cache as one of the major infrastructure elements brings many benefits for large web applications: it reduces your application servers load, reduces average response times on your site, etc. But there is one problem every developer experiences when works with such a cache – cached content invalidation.
It is a complex problem that usually consists of two smaller ones: individual cache elements invalidation (you need to keep an eye on your data changes and invalidate cached pages when related data changes) and full cache purges (sometimes your site layout or page templates change and you need to purge all the cached pages to make sure users will get new visual elements of layout changes). In this post I’d like to look at a few techniques we use at Scribd to solve cache invalidation problems.
So, the first problem – ongoing cache invalidation when content changes. This is actually a pretty simple task in squid: you just use HTCP protocol and send CLR requests to your caching farm (we didn’t find any HTCP protocol implementations so we’ve implemented our own simple client that supports just one command).
Since we use haproxy to balance our traffic in the cluster it is hard to predict where should we send a purge request. So we fan those out to all cache servers.
To make sure cache purging won’t slow the site down, especially considering we need to do more that just a simple cache purge (submit documents to search indexes, etc, etc), we just spool a “document changed” request to a queue and then have a set of asynchronous processes that do all the work in background.
Next, The Hard Problem – handling full cache purges w/o killing our backend servers with 5x-10x traffic (our normal hit ratio is ~90-95%).
We’ve spent a lot of time thinking about this problem and the first idea we came up with was to have a loop process somewhere that would iterate all documents we have cached and purge them one by one… but that does not seem to be a practical solution when you have tens of millions documents (and few page versions per document) and obviously the solution would not scale with constantly growing documents corpus.
So we kept brainstorming and finally got one idea that works just perfectly for us: what if we’d be able to take our traffic and define a function f(t) that would return a percentage of the traffic that should be purged at any moment in time. So we did it – we’ve implemented a nginx module that would version our cache by assigning every cached page a revision (using a custom HTTP-headers + Vary-caching) and would be able to slowly migrate the cache from one revision to another over a pre-defined period of time.
Here is an overview of the requests/responses flow in our web/cache/application cluster:
X-Cache-Revision value selection algorithm is the following:
Having this kind of logic in place allows us to do so called “slow” cache purges that could take any time from a few minutes (that still helps to reduce the load spike generated by the hottest content) up to many hours (this is what we normally use) or days (never used this option, but it is definitely possible).
Here is an example 100% cache purge over an 8 hour interval:
As you can see, during those slow purges our cached pages would be slowly updated without putting too much pressure on the backend. Cache hit ratio would slowly degrade and then slowly get back to its normal levels, but with our normal (6-8 hours) purges hit ratio never gets lower that 65-70% which makes it possible for us to save huge amounts of money on not having 90% spare capacity just for the cache purge load surges (we used to have lots of spare application cluster capacity before introducing this approach).
Update (June 5th, 2010): Request/response flow and cache revision selection algorithm added.