Category: Development
Advanced Squid Caching for Rails Applications: Preface
25 Oct2008

Since the day one when I joined Scribd, I was thinking about the fact that 90+% of our traffic is going to the document view pages, which is a single action in our documents controller. I was wondering how could we improve this action responsiveness and make our users happier.

Few times I was creating a git branches and hacking this action trying to implement some sort of page-level caching to make things faster. But all the time results weren’t as good as I’d like them to be. So, branches were sitting there and waiting for a better idea.
Read the rest of this entry


Bounces-handler Released
3 Aug2008

Today I’ve managed to finish initial version of our bounces-handler package we use for mailing-related stuff in Scribd.

Bounces-handler package is a simple set of scripts to automatically process email bounces and ISPโ€˜s feedback loops emails, maintain your mailing blacklists and a Rails plugin to use those blacklists in your RoR applications.

This piece of software has been developed as a part of more global work on mailing quality improvement in Scribd.com, but it was one of the most critical steps after setting up reverse DNS records, DKIM and SPF.

The package itself consists of two parts:

  • Perl scripts to process incoming email:
    • bounces processor โ€” could be assigned to process all your bounce emails
    • feedback loops messages processor โ€” more specific for Scribd, but still – could be modified for your needs (will be released soon).
  • Rails plugin to work with mailing blacklists

For more information, please check our README file. If you have any questions, comments or suggestions, please leave them here as a comments and I’ll try to reply as soon as possible.


Using Sphinx for Non-Fulltext Queries
19 May2008

How often do you think about the reasons why your favorite RDBMS sucks? ๐Ÿ™‚ Last few months I was doing this quite often and yes, my favorite RDBMS is MySQL. The reason why I was thinking so because one of my recent tasks at Scribd was fixing scalability problems in documents browsing.

The problem with browsing was pretty simple to describe and as hard to fix – we have large data set which consists of a few tables with many fields with really bad selectivity (flag fields like is_deleted, is_private, etc; file_type, language_id , category_id and others). As the result of this situation it becomes really hard (if possible at all) to display documents lists like “most popular 1-10 pages PDF documents in Italian language from the category “Business” (of course, non-deleted, non-private, etc). If you’ll try to create appropriate indexes for each possible filters combination, you’ll end up having tens or hundreds of indexes and every INSERT query in your tables will take ages.

Read the rest of this entry


Command Line History
28 Apr2008

Inspired by the Rail Spikes:

1
2
3
4
5
6
7
8
9
10
11
12
bash-3.2$ history 1000 | awk '{a[$2]++}END{for(i in a){print a[i] " " i}}' | sort -rn | head
228 cd
167 git
10 ssh
10 DEPLOY=production
6 sudo
6 pwd
6 ./script/import_views.rb
5 rm
4 rake
4 mv
bash-3.2$

Really interesting stats, I’d never guess that git is used more than ssh on my desktop (I’m a remote worker and mysql consultant so I ssh really often). ๐Ÿ™‚


InnoDB Recovery toolset Version 0.3 Released
14 Apr2008

Even though I didn’t go to MySQL conf this year (really sad about this), this week is gonna be most active in the community so I decided to do some community stuff too ๐Ÿ™‚ Today I’ve released version 0.3 of our innodb recovery toolkit. Now it became much faster, stable and accurate. At this moment it is possible to recover almost any table from corrupted/deleted tablespace without so much effort as it was before. Here is a short changes list (since 0.1 announced here):

  • More MySQL data types added: DECIMAL (both old and new), DATE, TIME
  • CHAR data type handling improved in table definitions generator
  • Indexes filtering added to page_parser
  • 64-bit stat() support added to all tools
  • Linux has no isnumber() function so we define our own implementation (pretty simple)
  • Lots of fixes in create_defs.pl script – now it generates definitions which could recover your data in 80% cases w/o any changes.
  • Min/max record size calculation fixed in constraints-based parser.
  • Nullable fixed-size columns support is fixed.
  • Debug logging is much cleaner now.

As always, if you need any help with your recovery, we would love to help.