It’s been a while since I’ve posted my first post about the way we do document pages caching in Scribd and this approach has definitely proven to be really effective since then. In the second post of this series I’d like to explain how we handle our complex document URLs and logged in users in the caching architecture.
First of all, let’s take a look at a typical Scribd’s document URL: http://www.scribd.com/doc/1/Improved-Statistical-Test.
As we can see, it consists of a document-specific part (/doc/1) and a non-unique human-readable slug part (/Improved-Statistical-Test). When a user comes to the site with a wrong slug in the document URL, we need to make sure we send the user to the correct URL with a permanent HTTP 301 redirect. So, obviously we can’t simply send our requests to the squid because it’d cause few problems:
- When we change document’s title, we’d create a new cached item and would not be able to redirect users from the old URL to the new one
- When we change a title, we’d pollute cache with additional document page copies.
One more problem that makes the situation even worse – we have 3 different kinds of users on the site:
- Logged in users – active web site users that are logged in and should see their name at the top of the page, should see all kinds of customized parts of the page, etc (especially when a page is their own document).
- Anonymous users – all users that are not logged in and visit the site with a flash-enabled browser
- Bots – all kinds of crawlers that can’t read flash content and need to see a plain text document version
All three kinds of users should see their own document page versions whether the page is cached or not.