1,370,056,012 rows later
Friday, January 18, 2008 at 01:44PM A while back I wrote about how the wonderful Netflix Prize competition promised a cool million dollars to anyone who could better the Netflix recommendaton engine by ten percent or more.
Back then, my approach was flawed, and the constraints of my project didn’t lend itself well to the Netflix domain. Now, almost a year later, I approached the problem slightly differently. I’ll not get into details, but my results are much better and with a little tweaking, well, maybe I can stand a chance.
MySQL MyISAM tables are suprisingly resilent. I now have over 1.3 billion rows in a single MySQL database. Of those, 1.2 billion rows are in one table. Somehow this blows my mind.
Here are some tips when you have large amounts of data:
Maybe MySQL is not the best choice
Consider non-relational solutions. Check out hadoop, Google MapReduce and BigTable, etc.
Also consider other databases. MonetDB is an interesting alternative. Smaller and more lightweight alternatives like SleepyCat may do the trick.
Do it the right way
If you want to stick with MySQL, remember to maximize performance:
Look into partioning your tables. You can reach 10x speedups quite easily. (This functionality is in the 5.1 release candidate.)
Use the datatypes properly. Use
TINYINT,MEDIUMINT, etc. When you have billions of datarows, choosing the right size of data really matters.Compress and make your read-only tables read-only.
myisamcheckandmyisampackare your friends. Rebuilding and compressing shrunk an index for me from 17GB to 11GB, and sped up reads by almost 30%.Tweak your configuration. MySQL have gazillions of config options, and I’m sure there are better sources for advice, but tweaks that made an immediate difference for me were (in
/etc/my.cnf):key_buffer_size=512Mandsort_buffer_size=16M. Both these values are set very low by default.
The Netflix Prize is a fun little project to do on the side. Hopefully I’ll get to blog soon about what I’m doing with my cool $1M.



Reader Comments (1)
I am a database expert. I tried a database approach with the fastest commercial database around and got the database size up to 70 GB. I can tell you that for any reasonable algorithm any database with any amount of tweaking will be too slow.
All the data fits comfortably withing less than 1GB of memory if you do it right. The only way to get decent performance on this amount of information is to avoid disk I/O.