The million dollar speed-up
Monday, March 5, 2007 at 10:00AM The NetFlix prize contest challenges you to write a recommender engine that improves the current CineMatch engine ten percent or more. If you do, you can win a million dollars. Pretty cool.
As part of the challenge, NetFlix provides datasets of 110 million recommendations given by 500K users on 17k movies. While I have little hope of winning the competition, that serious amount of data (~4GB) is very useful for testing any recommendation engine.
I wrote a Rails app around a few tables to capture the data. While Rails’ ActiveRecord is cool as an ORM, I had some issues getting this working. Here is what I found. Remember me if it helped you win that million dollars. :)
MyISAM
InnoDB tables are way too slow. You don’t need transactional row locking. MyISAM is the way to go. In the db migration steps, add option:
create_table :users, \
:options => "ENGINE=MyISAM" do |t| ...
MySQL version
You will need version at least version 5.0.6. This is needed for LOAD INFILE (see below).
Native MySQL bindings
To get best speed, you need to install the native MySQL bindings. I found the only one that works is version 2.5.1 (go figure). Make sure you point it to where your MySQL is installed, for example:
sudo gem update mysql -- \
--with-mysql-dir=/usr/local/mysql
Webrick
Strangely enough Webrick seems faster than mongrel. You should experiment and see. ruby script/server webrick forces webrick in mongrel’s stead.
You should watch for Ruby’s automatic conversion of integers to strings. In many cases this can seriously slow down searches. One way to check what’s going on is to run the Mongrel server (ruby script/server) which spews out the actual SQL by default.
Loading the data
There is a massive amount of data to load. I found going through the files, parsing them and using the obvious Rails object.create unbelievably slow. Using the sequence below, I managed to fill the database in a few hours, which is pretty good for over 110 million entries on a Mac laptop!
Insert users
To find them, go through the 17770 training set data files and extract all users. The following produces a file with all users sorted and unique:
$ find . -name "*txt" -exec cat {} \; | cut -f1 \
-d ',' > all_users.out
$ sort all_users.out | uniq > \
all_users_sorted_uniq.txt
Then insert into the table. Note the use of load data local infile which is by far the fastest way to populate a table:
Find.find(File.join(File.dirname(__FILE__), \
"netflix_data", "all.out.sorted.uniq")) do |f|
execute "load data local infile '#{f}' into \
table users fields terminated by ',' (@id) \
set id=@id, name=@id"
end
Insert recommendations
Again, use load data local infile. Note the way I derive the product_id (the movie id) from the file name.
s = "into table recommendations fields \
terminated by ',' ignore 1 lines (user_id, \
weight, date)"
dir = File.join(File.dirname(__FILE__), \
"netflix_data", "all_data")
Find.find(dir) do |f|
next if FileTest.directory?(f)
next unless File.extname(f) == ".txt"
puts "#{f}"
# mv_0017764.txt -> 17764
prod_id = File.basename(f).\
split(/\./)[0][-5,5].to_i
execute "load data local infile '#{f}' "\
+s+" set product_id=#{prod_id}"
end
Insert movies
Just use the techniques outlined for users above. :)
Common SQL advice
I’ll end with some useful SQL advice for completeness.
Add indexes on columns you need to search on, for example
add_index :recommendations, [:user_id]Limit the lengths of keys. Varchars should be limited to < 10 chars, for example
t.column :name, :string, :limit=>7
That’s it. Now it’s up to you to write a good recommendation engine. Let me know how it goes …
Hans |
Post a Comment | 


Reader Comments