Using Sphinx for Non-Fulltext Queries
How often do you think about the reasons why your favorite RDBMS sucks?
Last few months I was doing this quite often and yes, my favorite RDBMS is MySQL. The reason why I was thinking so because one of my recent tasks at Scribd was fixing scalability problems in documents browsing.
The problem with browsing was pretty simple to describe and as hard to fix – we have large data set which consists of a few tables with many fields with really bad selectivity (flag fields like is_deleted, is_private, etc; file_type, language_id , category_id and others). As the result of this situation it becomes really hard (if possible at all) to display documents lists like “most popular 1-10 pages PDF documents in Italian language from the category “Business” (of course, non-deleted, non-private, etc). If you’ll try to create appropriate indexes for each possible filters combination, you’ll end up having tens or hundreds of indexes and every INSERT query in your tables will take ages.
So, we were doing many weird things to solve this problem, like having really huge covering indexes for most popular browsing situations, having powerful (and expensive) dedicated servers for browsing queries, etc. Nothing helped and we ended up reducing filters number to 1-2 per query. But with continuously growing documents base we’d really love to help people find interesting content and filters is what we’d really want to add to the project.
Long story short, we’ve decided to try to use Sphinx search engine for really unusual thing – we decided to move all SELECT queries for our documents to a small (proof of concept) sphinx installation on one of our DB servers and use Sphinx’ full-scan feature. After a few hours of development I’ve created a simple to use module which mimics ActiveRecord’s find method interface and allows us to perform queries like the following (sometimes even more powerful than AR):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | browser = Scribd::SphinxBrowser.new(:host => 'ip.add.re.ss') # Get top 10 hottest docs result = browser.browse # Get top 10 most commented docs this week result = browser.browse(:order => 'comments_week DESC') # Get top 5 most voted docs of all times result = browser.browse(:order => 'votes_total DESC', :limit => 5) # Get top 10 most viewed docs for today (only docs with language_id = 5 returned) result = browser.browse(:order => 'views_today DESC', :filters => { :language_id => 5 }) # Get 10th to 19th hottest docs with language_id = 1, 2 or 3 result = browser.browse(:filters => { :language_id => [1, 2, 3] }, :limit => 10, :offset => 10) # Get top 10 hottest docs with language_id = 1 and pages count between 1 and 10 (excluding doc with ID = 5) result = browser.browse(:filters => { :language_id => 1, :page_count => 1..10 }, :remove_id => 5) # Get top 10 latest docs with file_type = 1 and word_user_id = 12345 result = browser.browse(:order => 'created_at DESC', :filters => { :file_type => 1, :user_id => 12345 }) |
After a short testing we’ve got our performance benchmarks results and here are some facts:
- documents count – 3000000+
- re-indexing time – 20-30 seconds for an entire data set (4 large joined tables)
- longest query (surprisingly – a query without any filters) takes about 0.5 sec on one core
- generic query with 1-2 filters takes 0.01-0.05 sec on one core
- queries with many filters take 0.01 sec and less on one core
- queries time could be reduced almost linearly when adding more cores (splitting index on many pieces).
- sphinx scales really easily on more than one box if we’ll need it to.
As you can see here, it is really useful sometimes to think “out of the box” when you work with some tools – we were using sphinx for a quite a long time now, but this solution for browsing was like a blessing and really helped us to keep up with service popularity growth.
Related posts:
12 Responses to this entry
We use this technique for filtering searches on categories, price, etc. We just stick cat_39 in a deal and add that to the sphinx query. Works like a charm.
Brian,
There are two different techniques – one is adding pseudo keywords and other is using filters and “Full Table Scan” query.
Pseudo keywords works if there are few of them in the query and they are well selective.
If you will have 10 keywords each with 90% selectivity this would not work well.
Filters w full table scan however can be slower with good selectivity but you can have a lot of non selective filters with very low penalty.
Also note Sphinx “full table scan” has optimizations so it often can skip a lot of “blocks” based on filters which makes it kind of similar to indexed lookup
Good блог at you, cognitive:)
Do you guys use Sphinx to index the contents of your documents, or only metadata about them that is stored in the database?
2W. Andrew Loe III: No, for browse index we use documents’ attributes only.
Hi!
First of all, you blog is very interesting and I like it, salute from Ireland to you!
The problem is with MySQL(like with every DBMS written by folks wo/ corporate experience) still the relative small amount of features to handle something bigger than the average.
Let me show you what i am thinking about:
http://www.ibm.com/developerworks/db2/library/techarticle/dm-0605ahuja2/index.html
http://www.oracle.com/technology/products/oracle9i/datasheets/partitioning.html
http://msdn.microsoft.com/en-us/library/ms345146.aspx
And just for the record, innodb is an Oracle product.
Maybe you can try this:
http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations.html
Regards,
lix
Hello Alexei,
I have one comment regarding this:
From what I’ve read so far Sphynx is great only for non-constantly-changing-data… Am I correct.
If you have thousands of writes per second and you want the info to be available within 1-3 second span how would you do that with sphynx??? if reindexing takes 20-40 seconds of 3Mil rows as you said…
Andrew has said that online add and remove is coming in 0.9.9. When will it be ready? Don’t know. But that will be a big, big jump in usefulness.
Yes I agree sphinx makes for a great general purpose index. We put a huge database in sphinx for full text, but realised it could be used an index on many of the other fields too!
@Pavel
Sphinx can search multiple indexes in one query, so the basic idea is to create a main index, and the a delta index, you just more regually reindex the delta index. You can even have multiple delta indexes – basically constantly reindexing the changes. You can also use line attribute updates to ‘delete’ items from the main index without reindexing.
спасибо за статью… добавил в ридер
A great idea man , but i would be better if you could post in English for other to understand
Has read with the pleasure, very interesting post, write still, good luck to you!