Search
Tweets
Blogroll
Powered by Squarespace
« Passwords in the cloud | Main | Yahoo! and OpenID »
Friday
Jan182008

1,370,056,012 rows later

A while back I wrote about how the wonderful Netflix Prize competition promised a cool million dollars to anyone who could better the Netflix recommendaton engine by ten percent or more.

Back then, my approach was flawed, and the constraints of my project didn’t lend itself well to the Netflix domain. Now, almost a year later, I approached the problem slightly differently. I’ll not get into details, but my results are much better and with a little tweaking, well, maybe I can stand a chance.

MySQL MyISAM tables are suprisingly resilent. I now have over 1.3 billion rows in a single MySQL database. Of those, 1.2 billion rows are in one table. Somehow this blows my mind.

Here are some tips when you have large amounts of data:

Maybe MySQL is not the best choice

Consider non-relational solutions. Check out hadoop, Google MapReduce and BigTable, etc.

Also consider other databases. MonetDB is an interesting alternative. Smaller and more lightweight alternatives like SleepyCat may do the trick.

Do it the right way

If you want to stick with MySQL, remember to maximize performance:

  • Look into partioning your tables. You can reach 10x speedups quite easily. (This functionality is in the 5.1 release candidate.)

  • Use the datatypes properly. Use TINYINT, MEDIUMINT, etc. When you have billions of datarows, choosing the right size of data really matters.

  • Compress and make your read-only tables read-only. myisamcheck and myisampack are your friends. Rebuilding and compressing shrunk an index for me from 17GB to 11GB, and sped up reads by almost 30%.

  • Tweak your configuration. MySQL have gazillions of config options, and I’m sure there are better sources for advice, but tweaks that made an immediate difference for me were (in /etc/my.cnf): key_buffer_size=512M and sort_buffer_size=16M. Both these values are set very low by default.

The Netflix Prize is a fun little project to do on the side. Hopefully I’ll get to blog soon about what I’m doing with my cool $1M.

Reader Comments (1)

If your current approach is still based on database - it is still flawed.
I am a database expert. I tried a database approach with the fastest commercial database around and got the database size up to 70 GB. I can tell you that for any reasonable algorithm any database with any amount of tweaking will be too slow.
All the data fits comfortably withing less than 1GB of memory if you do it right. The only way to get decent performance on this amount of information is to avoid disk I/O.
January 19, 2008 | Unregistered CommenterDR

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
All HTML will be escaped. Hyperlinks will be created for URLs automatically.