Since the day one when I joined Scribd, I was thinking about the fact that 90+% of our traffic is going to the document view pages, which is a single action in our documents controller. I was wondering how could we improve this action responsiveness and make our users happier.
Few times I was creating a git branches and hacking this action trying to implement some sort of page-level caching to make things faster. But all the time results weren’t as good as I’d like them to be. So, branches were sitting there and waiting for a better idea.
Few months ago my good friend has joined Scribd and we’ve started thinking on this problem together. As the result of our brainstorming we’ve managed to figure out what were the problems preventing us from doing efficient caching:
- First of all, a lots of code in the action is changing the page view if our visitor is a bot (no, not a cloaking, just some minor adjustments of the view).
- Second problem was a set of differences in the view for anonymous and logged in users.
- And finally, third problem was the fact that the page has a few blocks that change pretty dynamically: document stats pane and comments lists.
All these problems when combined were creating a lots of pain when I was trying to cache a whole page. When we’ve figured them out, we’ve started thinking on how could we generalize possible combinations of those factors and possible approaches to caching.
There is a well known idea in web applications development: the fastest web app action is an action that does not require any code to be executed on your application server. So, first idea we’ve tried to think about was some approach that would definitely reduce the number of hits on our app servers. This idea was based on HTTP protocol features related to Last-Modified and E-Tag headers. But there was a problem – not so many users go to the same page twice so even if we’d make the page cacheable, it wouldn’t help too much. But the idea of full page caching outside of the application was really good and we’ve started playing with it to figure out how to use it in production.
Long time ago, when Internet was slow and expensive many ISPs and large companies were trying to reduce their traffic w/o hurting users’ experience. Then caching proxy servers were born. The idea of those servers was to handle all web requests going from a network (ISP or a company office) and try to cache as much content as possible so when the same or some other user would request a cached page, proxy server would return it really fast. If we’d implement support for those Last-Modified headers, all proxy servers would be happy to cache our pages. But there was a problem – no one uses caching proxies in 2008 So, we’ve got an idea – why can’t we place such a server in front of our application and make it cache content for all users in the world? (Yes, we knew about a caching reverse proxies before – I’m just trying to explain the flow of our thoughts and words when we were brainstorming the problem).
The only problem with this approach would be to differentiate logged in users, anonymous users and bots. Considering the fact that our proxy server could be placed between the app and our web servers (nginx), we’ve decided to create a nginx module that would translate the same document page URLs to a set of URLs, which would be different for all those 3 kinds of users.
When all those problems with different kinds of users were solved, we’ve decided to solve the last one – non-cacheable dynamic stats pane. The solution was pretty simple – we’ve added a small ajax call to the page which would update stats on the cached version of our page for all real users while bots will see the same page, but with a bit stale stats pane.
Long story short, the results is really great. Application servers load reduced by 50-70%, database servers load is reduced by 30-60%, response times dropped down to 150-200 msec from 500-750 msec. As an additional positive effect of the caching we’ve managed to remove all fragments caches from the application and free more of memcached resources for data caches. Here are a few cacti graphs of our servers load/traffic (the caching was introduced on Oct 9th at night):
Main MySQL command counters:
One of our Application Servers CPU Usage:
One of our Application Servers Load Average:
Unfortunately there are a lot of things to share related to this caching experience, so I’ve decided to make a series of posts that would explain all the problems we had and solutions we’ve found for each of the following parts of the caching system:
- Logged in Users and Complex URLs Handling
- Rails code to support Last-Modified headers and how we purge caches
- Squid Server setup (configs and hardware)
- Nginx module development
So, if you’re interested in details, subscribe to this blog’s RSS feed and in a few days you’ll see the first article from this series.