- Posted in: Databases, Development
Last few days one of our customers (one of the largest Ruby on Rails sites on the Net) was struggling to solve some really strange problem – once upon a time they were getting an error from ActiveRecord on their site:
1 | (ActiveRecord::StatementInvalid) "Mysql::Error: Lock wait timeout exceeded; try restarting transaction: UPDATE some_table..... |
They have innodb_lock_wait_timeout set to 20 seconds. After a few hours of looking for strange transactions we were decided to create s script to dump SHOW INNODB STATUS and SHOW FULL PROCESSLIST commands output to a file every 10 seconds to catch one of those moments when this error occurred.
Today we’ve got next error and started digging in our logs…
After a couple of hours of looking in logs, parsing binary logs from MySQL and thinking we’ve found a following image of the situation:
1 2 3 4 5 6 7 8 9 | # at 804649039 #071130 14:56:25 server id 4 end_log_pos 1011 Query thread_id=71318 exec_time=3 error_code=0 SET TIMESTAMP=1196456185/*!*/; UPDATE some_table .... # at 804650050 #071130 14:57:02 server id 4 end_log_pos 804650077 Xid = 220437663 COMMIT/*!*/; |
This transaction was locking a record and caused all those errors. We’ve took a look at project sources and didn’t found any transactions used there at all. So, it was obvious this problem transaction (with one query in it) was generated by ActiveRecord.
We’ve discussed this problem with our customer and started searching for a reason, why would AR wait for 37 seconds between a query and COMMIT command.
At the end we’ve found a reason. Customer uses acts_as_ferret for search on their site. Site has a huge documents database so ferret index updates became pretty slow. In acts_as_ferret index updates are called from a following AR hooks:
1 2 3 4 5 | ... after_create :ferret_create after_update :ferret_update after_destroy :ferret_destroy ... |
Next step was obvious – we’ve created a test project with a simple test model:
1 2 3 4 5 6 7 8 | class LockTest < ActiveRecord::Base after_update :my_lock def my_lock logger.debug("Starting sleep for 10 sec") sleep(10) logger.debug("Wake up!") end end |
Next – one test update of the data and the results in development.log:
1 2 3 4 5 | SQL (0.000093) BEGIN LockTest Update (0.000374) UPDATE lock_tests SET `created_at` = '2007-11-30 20:14:28', `updated_at` = '2007-11-30 20:14:43', `a` = 2 WHERE `id` = 1 Starting sleep for 10 sec Wake up! SQL (0.010107) COMMIT |
So, maybe we were reading AR docs in a wrong way, or there is no such info, but :after_update callback is called BEFORE the actual data COMMITed to the DB! In our customer’s case, some user saved the document, ferret stated updating its indexes, record was locked for 30+ seconds and every updates to the same record (next tries to click to a Save button, for example) were failing with timeout errors.
So far so good, we’re going to move to Sphinx and hope there will be no such crazy errors in future.