Advanced Squid Caching in Scribd: Logged In Users and Complex URLs Handling
21 Jul2009

It’s been a while since I’ve posted my first post about the way we do document pages caching in Scribd and this approach has definitely proven to be really effective since then. In the second post of this series I’d like to explain how we handle our complex document URLs and logged in users in the caching architecture.

First of all, let’s take a look at a typical Scribd’s document URL: http://www.scribd.com/doc/1/Improved-Statistical-Test.

As we can see, it consists of a document-specific part (/doc/1) and a non-unique human-readable slug part (/Improved-Statistical-Test). When a user comes to the site with a wrong slug in the document URL, we need to make sure we send the user to the correct URL with a permanent HTTP 301 redirect. So, obviously we can’t simply send our requests to the squid because it’d cause few problems:

  • When we change document’s title, we’d create a new cached item and would not be able to redirect users from the old URL to the new one
  • When we change a title, we’d pollute cache with additional document page copies.

One more problem that makes the situation even worse – we have 3 different kinds of users on the site:

  1. Logged in users – active web site users that are logged in and should see their name at the top of the page, should see all kinds of customized parts of the page, etc (especially when a page is their own document).
  2. Anonymous users – all users that are not logged in and visit the site with a flash-enabled browser
  3. Bots – all kinds of crawlers that can’t read flash content and need to see a plain text document version

All three kinds of users should see their own document page versions whether the page is cached or not.

So, how do we solve these two problems? Here is how.

First of all, to fix the URLs problem we’ve decided to rewrite the URL before it goes to a squid server. We change URLs to look like this: http://www.scribd.com/doc/1?__enable_docview_caching=1. This makes the document URL dependent only on a unique document ID that never change and sends an additional parameter to the backend to signal that the page could potentially be cached. The slug is sent to backend using an HTTP-header (X-Scribd-Slug) so that backend could check the slug and return a redirect if needed.

To make sure we won’t respond with a cached page to a request with an invalid URL (invalid slug basically), we use Vary: X-Scribd-Slug HTTP header which is implemented in Squid (only late 2.6 and 2.7) and makes it check specified headers in a request before responding with a cached content. If the header of the cached content is different then the header in the request, squid proxies the resuest to backend where we could handle the cache miss as we want.

Next, to resolve the users problem we’ve created a small nginx module that looking at a request headers could tell you whether the user is a bot or not and whether he’s logged in or an anonymous visitor. This module basically exposes a $scribd_user_id variable to our configs and we could use the variable to do separate configuration for different kinds of users.

At this point we do not cache document pages for logged in users so we basically have two copies of each page in the cache: flash-enabled document page and an inline document page. We do this separation by changing our cache URLs one more time: we add a scribd_user_id=$scribd_user_id variable (anonymous = 0, bot = -1) to the cache URL: http://www.scribd.com/doc/1?__enable_docview_caching=1&scribd_user_id=0.

And last, not not least, we use two really awesome Squid features called stale-while-revalidate and stale-if-error (AFAIR, they were invented in Yahoo! and then described by their squid admin).

Option stale-while-revalidate allow us to quickly serve content from the cache while doing background re-validation requests to the Rails backend. Option stale-if-error basically allows us to serve content from the squid cache when Rails backend is down/dead/slow.

All these changes allowed us to handle more traffic with less hardware and what is even more important, they helped us improve user experience with the site: response times dropped 2-3 fold and much less people see our Ouch! pages (HTTP 50x errors when backend is dead or overloaded). Here is an example of one of our servers’ hit ratio and traffic savings daily graphs:

graph_image

traffic_savings

This it with the logged in users and complex URLs handling in Scribd caching architecture. Next post in this series will explain how we do cache invalidation in Scribd. Stay tuned.