Search
Tweets
Blogroll
Powered by Squarespace
« Java and OpenID | Main | True user-centric identity »
Monday
Mar052007

The million dollar speed-up

The NetFlix prize contest challenges you to write a recommender engine that improves the current CineMatch engine ten percent or more. If you do, you can win a million dollars. Pretty cool.

As part of the challenge, NetFlix provides datasets of 110 million recommendations given by 500K users on 17k movies. While I have little hope of winning the competition, that serious amount of data (~4GB) is very useful for testing any recommendation engine.

I wrote a Rails app around a few tables to capture the data. While Rails’ ActiveRecord is cool as an ORM, I had some issues getting this working. Here is what I found. Remember me if it helped you win that million dollars. :)

MyISAM

InnoDB tables are way too slow. You don’t need transactional row locking. MyISAM is the way to go. In the db migration steps, add option:

    create_table :users, \
      :options => "ENGINE=MyISAM" do |t| ...

MySQL version

You will need version at least version 5.0.6. This is needed for LOAD INFILE (see below).

Native MySQL bindings

To get best speed, you need to install the native MySQL bindings. I found the only one that works is version 2.5.1 (go figure). Make sure you point it to where your MySQL is installed, for example:

    sudo gem update mysql -- \
      --with-mysql-dir=/usr/local/mysql

Webrick

Strangely enough Webrick seems faster than mongrel. You should experiment and see. ruby script/server webrick forces webrick in mongrel’s stead.

You should watch for Ruby’s automatic conversion of integers to strings. In many cases this can seriously slow down searches. One way to check what’s going on is to run the Mongrel server (ruby script/server) which spews out the actual SQL by default.

Loading the data

There is a massive amount of data to load. I found going through the files, parsing them and using the obvious Rails object.create unbelievably slow. Using the sequence below, I managed to fill the database in a few hours, which is pretty good for over 110 million entries on a Mac laptop!

Insert users

To find them, go through the 17770 training set data files and extract all users. The following produces a file with all users sorted and unique:

    $ find . -name "*txt" -exec cat {} \; | cut -f1 \
      -d ',' > all_users.out
    $ sort all_users.out | uniq > \
      all_users_sorted_uniq.txt

Then insert into the table. Note the use of load data local infile which is by far the fastest way to populate a table:

    Find.find(File.join(File.dirname(__FILE__), \
       "netflix_data", "all.out.sorted.uniq")) do |f| 
      execute "load data local infile '#{f}' into \
          table users fields terminated by ',' (@id) \
          set id=@id, name=@id"
    end

Insert recommendations

Again, use load data local infile. Note the way I derive the product_id (the movie id) from the file name.

    s = "into table recommendations fields \
      terminated by ',' ignore 1 lines (user_id, \
      weight, date)"
    dir = File.join(File.dirname(__FILE__), \
      "netflix_data", "all_data") 
    Find.find(dir) do |f| 
      next if FileTest.directory?(f)
      next unless File.extname(f) == ".txt"
      puts "#{f}"
      # mv_0017764.txt -> 17764
      prod_id = File.basename(f).\
        split(/\./)[0][-5,5].to_i  
      execute "load data local infile '#{f}' "\
        +s+" set product_id=#{prod_id}"
    end

Insert movies

Just use the techniques outlined for users above. :)

Common SQL advice

I’ll end with some useful SQL advice for completeness.

  • Add indexes on columns you need to search on, for example

    add_index :recommendations, [:user_id]
    
  • Limit the lengths of keys. Varchars should be limited to < 10 chars, for example

    t.column :name, :string, :limit=>7
    

That’s it. Now it’s up to you to write a good recommendation engine. Let me know how it goes …

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
All HTML will be escaped. Hyperlinks will be created for URLs automatically.