cache Archives - Oleksiy Kovyrin

Advanced Squid Caching in Scribd: Hardware + Software Used

Posted in: Development, Networks
Tags: cache, haproxy, hardware, Nginx, scribd, squid

4 Aug2009

After the previous post in this caching related series I’ve received many questions on hardware and software configuration of our servers so in this post I’ll describe our server’s configs and the motivation behind those configs.

Read the rest of this entry →

Advanced Squid Caching in Scribd: Logged In Users and Complex URLs Handling

Posted in: Development, My Projects
Tags: cache, caching, Nginx, rewrite, scribd, squid

21 Jul2009

It’s been a while since I’ve posted my first post about the way we do document pages caching in Scribd and this approach has definitely proven to be really effective since then. In the second post of this series I’d like to explain how we handle our complex document URLs and logged in users in the caching architecture.

First of all, let’s take a look at a typical Scribd’s document URL: http://www.scribd.com/doc/1/Improved-Statistical-Test.

As we can see, it consists of a document-specific part (/doc/1) and a non-unique human-readable slug part (/Improved-Statistical-Test). When a user comes to the site with a wrong slug in the document URL, we need to make sure we send the user to the correct URL with a permanent HTTP 301 redirect. So, obviously we can’t simply send our requests to the squid because it’d cause few problems:

When we change document’s title, we’d create a new cached item and would not be able to redirect users from the old URL to the new one
When we change a title, we’d pollute cache with additional document page copies.

One more problem that makes the situation even worse – we have 3 different kinds of users on the site:

Logged in users – active web site users that are logged in and should see their name at the top of the page, should see all kinds of customized parts of the page, etc (especially when a page is their own document).
Anonymous users – all users that are not logged in and visit the site with a flash-enabled browser
Bots – all kinds of crawlers that can’t read flash content and need to see a plain text document version

All three kinds of users should see their own document page versions whether the page is cached or not.

Read the rest of this entry →

Homo-Adminus Blog

Yet Another Admin’s Blog