Thursday, September 3, 2009

How to win the Github contest

That was the question.

I only found out about the contest about 3 days before it ended. My Bloglines queue was showing 7000+ posts unread (as usual) and during a quick cleanup run--where I try to at least skim over the headers in my favorite folders--I found out about the contest.

The Netflix mention was a little funny, because I remembered downloading a huge file (1TB?) for _their_ contest and never even wrote a single line of code to process it. I still have to try to find it (now that their contest seems to be over) to free up some drive space :P.

This sums up my lack of free time and lack of resources rant. I still wanted to participate in the Github contest because it reminded me of the AppWatch days. We had a "people who like this app also like these apps" thing going on (inspired by the early Amazon features I think--this was early 1999) and I thought it was really cool and useful. That one ran as a perl script in the middle of the night, basically doing a recursive SQL query and fetching the results. It took about 5 min to run and not that much RAM (we "only" tracked about 2,000 apps and had a few thousand registered users).

So I had a motive, a few rusty tools (an AMD box built in 2001, 1GB RAM, half of that free--it's an old fileserver, my work machines are overloaded enough without running 100% CPU background tasks) and not much time. I knew that Ruby was not the right language for this but I wanted to give it a go. After all, what matters is how you solve the problem, not how fast you can process data structures. Right?

Yes, right. After spending a bit of time on the problem I realized a few things:

1) In order to get better scores in the contest, I'd need a faster language (C++ was my next choice if I had the time) and more free RAM;
2) I didn't think that the problem that inspired the contest would be solved, even if I achieved the maximum score (more on this later);
3) I still wanted to get a decent score.

And so I mostly abandoned my initial attempt (I think I found some time to make it somewhat "better", but after that my next idea used way too much RAM) and wondered how I could use the "open" nature of the contest to produce better results (with way less effort). I'm not a big fan of most free-form social classification techniques, because they're easily sabotaged and depend too much on the specific few that have a public or hidden interest in the matter at hand. But crowdsourcing is interesting to me because one gets to pick the crowd, i.e. you retain "editorial control" over the results.

To a trained mathematician this is of course ludicrous, but you can't escape the common sense behind crowdsourcing: "all of us is better than any of us". There's got to be a way to take each person's best effort and make the whole better than its best part.

I tried creating a crowdsourced compilation of the top results in the contest. This was very quick-and-dirty, I only ran it 3 or 4 times with similar data and little tweaking. It got me a 3rd spot in the rankings but IMHO proved me wrong. The data was too disperse to be properly weighted and reconstructed, and still keep its core value. In effect, the data was hard to blend, and my only tweaking option was to get even more similar to the top results, but not improve on them.

The few other entries I looked at looked pretty nice (code-wise) and used widely different approaches, which _should_ have resulted in a good crowdsourced blend. From a quick read of the related posts after the constest ended, this was not necessarily the case. It seems that there were plenty of other people trying indirect techniques similar to mine, which didn't really add any "intelligence" to the mix. Maybe my attempt was based on too many others like it, and that's why the results didn't improve as expected.

My goal was of course not to win the contest, I just wanted to try something out, and having the code right there available to other people to see and play with was the main drive for actually doing it. It was fun to see other people talking about it and getting new ideas from it.

So lets get back to "solving the problem". AFAICT the contest was about a data set that was processed to remove a few semi-random parts, and the goal was to guess the missing parts. This is fine and may lead to interesting discussions, but in my mind doesn't solve the problem. The problem should be: "how can we improve our recommendation system" and not "how can we anticipate what people will choose on their own".

Because people are not good at choosing. The fact that they made a choice doesn't mean it is the right one, even for themselves. There's time constraints involved, false assumptions, preconceptions, etc. I'm acutely aware of the fact that I probably don't use all of the gems and plugins that I should, and I spend _a lot_ of time evaluating tools. If you have a list of the products I currently use, then you know which products I use, that's it. And I already know which products I use :P

My point is, there were a lot of great entries that didn't get good scores. They were based on good ideas that would actually improve the overall productivity and maintainability for those who followed their advice. And quite frankly, I think that with that goal in mind, the crowdsource model would have fared a lot better.

With that in mind, I understand that to come up with a contest you need rules and goals, and you need a scoring system and all of that. I'm not criticizing the contest--I think it was great and I and lots of other people had a blast participating. I'm just saying that to actually come up with a better recommendation system there's a lot more that has to be considered. And from my experience both trying to figure out the solution to these problems, and trying to apply them to my daily tool/product/beer choosing needs, I still think that a self-refining-crowdsourced-over-time approach kicks the most ass!