Daily JD

Friday, June 18, 2010

Get your text back

Playing around with Nokogiri to process a bunch of HTML files, I ended up pushing the results to MongoDB. Remember, MongoDB _always_ stores UTF-8, which is a good thing. Unfortunately XML transforms like to escape anything they can, so I ended up with strings like:


ruby-1.9.1-p378 > a="Jos&#xE9; Saramago, 1922-2010"
=> "Jos&#xE9; Saramago, 1922-2010"

The old CGI module (RIP) could unescape these. Today REXML seems to be mode adequate. Here's a tip I picked up:


ruby-1.9.1-p378 > require 'rexml/document'
 => true
ruby-1.9.1-p378 > REXML::Text.unnormalize(a)
 => "José Saramago, 1922-2010" 
ruby-1.9.1-p378 > REXML::Text.unnormalize(a).encoding
 => #<Encoding:UTF-8>

Props to the original posters.

Wednesday, December 2, 2009

REE playing nice with rake gems:install

Following the advice by Pratik Naik given during his talk at Rails Summit Latin America, I decided to setup Passenger in development mode. While at it why not setup Ruby Enterprise Edition (REE) also? So I did that but rake gems:install (or in my case /opt/ree/bin/rake gems:install would not install the gems in the REE environment.

A bit of source spelunking turned out that Rails doesn't use Rubygems in a very elegant wait to run the rake tasks, mostly forking a new gem process. Even worse, Gem::Dependency simply guesses which gem command to run based on the current platform. I monkey-patched my Rakefile to force it to use REE (it's a project-specific file anyway) and that worked for me. It's not elegant and not portable but it's hard to make a case as to the "right" way of doing it, except for using the loaded Gem class (!!) correctly. If I ever get to doing that I'll submit a patch. For now, you can append this to your Rakefile:


module Rails
  class GemDependency < Gem::Dependency
    private
    def gem_command
      '/opt/ree/bin/gem'
      # or File.dirname(File.readlink("/proc/#{$$}/exe")) + '/gem'
    end
  end
end

and simply running sudo /opt/ree/bin/rake gems:install will work as expected.

Thursday, September 3, 2009

How to win the Github contest

That was the question.

I only found out about the contest about 3 days before it ended. My Bloglines queue was showing 7000+ posts unread (as usual) and during a quick cleanup run--where I try to at least skim over the headers in my favorite folders--I found out about the contest.

The Netflix mention was a little funny, because I remembered downloading a huge file (1TB?) for _their_ contest and never even wrote a single line of code to process it. I still have to try to find it (now that their contest seems to be over) to free up some drive space :P.

This sums up my lack of free time and lack of resources rant. I still wanted to participate in the Github contest because it reminded me of the AppWatch days. We had a "people who like this app also like these apps" thing going on (inspired by the early Amazon features I think--this was early 1999) and I thought it was really cool and useful. That one ran as a perl script in the middle of the night, basically doing a recursive SQL query and fetching the results. It took about 5 min to run and not that much RAM (we "only" tracked about 2,000 apps and had a few thousand registered users).

So I had a motive, a few rusty tools (an AMD box built in 2001, 1GB RAM, half of that free--it's an old fileserver, my work machines are overloaded enough without running 100% CPU background tasks) and not much time. I knew that Ruby was not the right language for this but I wanted to give it a go. After all, what matters is how you solve the problem, not how fast you can process data structures. Right?

Yes, right. After spending a bit of time on the problem I realized a few things:

1) In order to get better scores in the contest, I'd need a faster language (C++ was my next choice if I had the time) and more free RAM;
2) I didn't think that the problem that inspired the contest would be solved, even if I achieved the maximum score (more on this later);
3) I still wanted to get a decent score.

And so I mostly abandoned my initial attempt (I think I found some time to make it somewhat "better", but after that my next idea used way too much RAM) and wondered how I could use the "open" nature of the contest to produce better results (with way less effort). I'm not a big fan of most free-form social classification techniques, because they're easily sabotaged and depend too much on the specific few that have a public or hidden interest in the matter at hand. But crowdsourcing is interesting to me because one gets to pick the crowd, i.e. you retain "editorial control" over the results.

To a trained mathematician this is of course ludicrous, but you can't escape the common sense behind crowdsourcing: "all of us is better than any of us". There's got to be a way to take each person's best effort and make the whole better than its best part.

I tried creating a crowdsourced compilation of the top results in the contest. This was very quick-and-dirty, I only ran it 3 or 4 times with similar data and little tweaking. It got me a 3rd spot in the rankings but IMHO proved me wrong. The data was too disperse to be properly weighted and reconstructed, and still keep its core value. In effect, the data was hard to blend, and my only tweaking option was to get even more similar to the top results, but not improve on them.

The few other entries I looked at looked pretty nice (code-wise) and used widely different approaches, which _should_ have resulted in a good crowdsourced blend. From a quick read of the related posts after the constest ended, this was not necessarily the case. It seems that there were plenty of other people trying indirect techniques similar to mine, which didn't really add any "intelligence" to the mix. Maybe my attempt was based on too many others like it, and that's why the results didn't improve as expected.

My goal was of course not to win the contest, I just wanted to try something out, and having the code right there available to other people to see and play with was the main drive for actually doing it. It was fun to see other people talking about it and getting new ideas from it.

So lets get back to "solving the problem". AFAICT the contest was about a data set that was processed to remove a few semi-random parts, and the goal was to guess the missing parts. This is fine and may lead to interesting discussions, but in my mind doesn't solve the problem. The problem should be: "how can we improve our recommendation system" and not "how can we anticipate what people will choose on their own".

Because people are not good at choosing. The fact that they made a choice doesn't mean it is the right one, even for themselves. There's time constraints involved, false assumptions, preconceptions, etc. I'm acutely aware of the fact that I probably don't use all of the gems and plugins that I should, and I spend _a lot_ of time evaluating tools. If you have a list of the products I currently use, then you know which products I use, that's it. And I already know which products I use :P

My point is, there were a lot of great entries that didn't get good scores. They were based on good ideas that would actually improve the overall productivity and maintainability for those who followed their advice. And quite frankly, I think that with that goal in mind, the crowdsource model would have fared a lot better.

With that in mind, I understand that to come up with a contest you need rules and goals, and you need a scoring system and all of that. I'm not criticizing the contest--I think it was great and I and lots of other people had a blast participating. I'm just saying that to actually come up with a better recommendation system there's a lot more that has to be considered. And from my experience both trying to figure out the solution to these problems, and trying to apply them to my daily tool/product/beer choosing needs, I still think that a self-refining-crowdsourced-over-time approach kicks the most ass!

Monday, August 10, 2009

Addiction

Probably the best piece of hardware that I got my hands on recently was the Nokia E71. With it came the knowledge of the Symbian OS (S60), and the revelation that behind its quite non-sexy UI there's a Unix-ish openness to it. Then came the downfall: I installed FreeCell on the damn thing :P.

Not wanting to ruin the keyboard and joystick with action gaming (I need _some_ type of game to play while waiting in line for stuff) like I did with my last phone, I opted for a strategy game. Which became a minor (uh?) addiction. So minor that today I was only mildly inclined to search the interwebs for a freakin' FreeCell solver to convince myself that my current game was unsolvable. Aha! It wasn't :( Anyway at least that humbled me a bit, allowed me to cheat (to save time!) and presented me with some new build environments (ccmake anyone?). The project I'm talking about is Freecell Solver, and here's the game I needed help with (just to save time!):


QD  7H  QH  3C KH 2C 8S
8C  5H  AH  5D AD JH KC
5C  5S  9C  4D QC 8H 6D
AC  3D  7D  KS JS 7S JC
10D 10H 10S 3S KD 9H
4S  9S  4H  8D 6H 2H
10C QS  6S  6C 4C JD
3H  2D  2S  7C 9D AS

Try something like fc-solve -m -snx < <filename> to solve it. If you need to save even more time than I did ;) here's the solution:


8h 3a 4b 43 a3 4c 4d 41 4a 4h
74 a7 18 1h 1a 1h b1 12 1b c1
85 4c b4 58 4b d4 14 7d 7h 71
78 c7 1c b1 c1 5b 75 b7 75 27v3
2c 2h 2b 2h 6h b8 2b 25 b1 15
3b 32 31 c3 b2 6c 26 32v2 3b 13
d1 63 1d a1 21 d2 36 3a 3d b3
36 3h c3 63 2b a2 62 6c 6a c6
32 3c d3 63 6d b6 68 c6 3b 3c
b3 6b c6 3c a3 85 3a b3 23 6b
d6 c6 58 26 36 3d a3 85 3c b3
23 c2 58 2a d2 63 2b 51 5c 5d
52 53 d2 c2 1d 1h 5c 57 c2 65
54 82 8h 8h 8h 8h 3h 2h 3h 2h
3h 2h 3h 2h 3h 2h 2h 4h 2h 4h
2h 4h 2h 4h 2h 4h 2h 4h ah dh
bh 1h 3h 1h 3h 1h 1h 7h 1h 7h
1h 7h 1h 7h

This game is solveable.
Total number of states checked is 9510.
This scan generated 9606 states.

BTW is it just me or is the board layout for fc-solve in CxL notation instead of LxC notation? I keep having to turn my head 90 degrees to make sense of it. And then I can't read the codes :P.

Wednesday, July 15, 2009

Fans drive me crazy

No, not that type of fan. The Canonical folks updated the Ubuntu 9.04 kernels and now my Aspire One started blowing its fan all the time again. After a bit of tinketing in ~/tmp and /etc/modules I remembered that the acerhdf module is _not_ included, so has to be reinstalled every time a new kernel shows up. So grab that and sudo make install.

xrandr to the rescue

Just got a new Samsung 22" 1080p monitor and plugged it into an Acer Aspire One. I have 3 OSes in this machine, so decided to configure all 3 for the new monitor. Windows XP did a decent job at setting it up (there's an "Advanced" tab in the video preferences that has lots of cool options -- video card specific). MacOS X 10.5.5 was even better, it came up with a bogus config but as soon as I pulled the new monitors plug, it reverted to the main netbook screen and recognized the new monitor after plugging it back in. All very easy.

Unfortunately Ubuntu 9.04 was not as friendly. Although Xorg did recognize all valid modelines at boot, for whatever reason the intel driver refused to show any of the higher ones. The same thing happened when I configured other monitors in the past, but I forgot what I had done to fix it, and didn't want to hard code the xorg.conf file because I will be using different external monitors on a regular basis.

Anyway the solution is quite simple. Just force xrandr to recognize the modeline you want (xrandr --newmode and copy/paste the modeline from Xorg.0.log, then xrandr --addmode). Now go to the display preferences and your shinny new display mode will be available. Logout and login and all is done. Yay!

Thursday, May 7, 2009

Swim on boot

I've got my Nokia E71 syncing with Google (contacts, calendar) and Ovi (those and more). As you probably know the S60 platform doesn't sync by itself on a regular basis so most people use Swim for that task. According to the Swim page it should run at boot time but on my E71 it doesn't seem to. I use Powerboot to run a few apps at boot time, but couldn't find the Swim process by name to add it. Turns out that ("use the source, Luke!") it's called SyncServer.