Daily JD

Get your text back

2010-06-18T19:34:00.000-07:00

Playing around with Nokogiri to process a bunch of HTML files, I ended up pushing the results to MongoDB. Remember, MongoDB _always_ stores UTF-8, which is a good thing. Unfortunately XML transforms like to escape anything they can, so I ended up with strings like:


ruby-1.9.1-p378 > a="Jos&#xE9; Saramago, 1922-2010"
=> "Jos&#xE9; Saramago, 1922-2010"

The old CGI module (RIP) could unescape these. Today REXML seems to be mode adequate. Here's a tip I picked up:


ruby-1.9.1-p378 > require 'rexml/document'
 => true
ruby-1.9.1-p378 > REXML::Text.unnormalize(a)
 => "José Saramago, 1922-2010" 
ruby-1.9.1-p378 > REXML::Text.unnormalize(a).encoding
 => #<Encoding:UTF-8>

Props to the original posters.

REE playing nice with rake gems:install

2009-12-02T19:52:00.000-08:00

Following the advice by Pratik Naik given during his talk at Rails Summit Latin America, I decided to setup Passenger in development mode. While at it why not setup Ruby Enterprise Edition (REE) also? So I did that but rake gems:install (or in my case /opt/ree/bin/rake gems:install would not install the gems in the REE environment.

A bit of source spelunking turned out that Rails doesn't use Rubygems in a very elegant wait to run the rake tasks, mostly forking a new gem process. Even worse, Gem::Dependency simply guesses which gem command to run based on the current platform. I monkey-patched my Rakefile to force it to use REE (it's a project-specific file anyway) and that worked for me. It's not elegant and not portable but it's hard to make a case as to the "right" way of doing it, except for using the loaded Gem class (!!) correctly. If I ever get to doing that I'll submit a patch. For now, you can append this to your Rakefile:


module Rails
  class GemDependency < Gem::Dependency
    private
    def gem_command
      '/opt/ree/bin/gem'
      # or File.dirname(File.readlink("/proc/#{$$}/exe")) + '/gem'
    end
  end
end

and simply running sudo /opt/ree/bin/rake gems:install will work as expected.

How to win the Github contest

2009-09-03T16:32:00.000-07:00

That was the question.

I only found out about the contest about 3 days before it ended. My Bloglines queue was showing 7000+ posts unread (as usual) and during a quick cleanup run--where I try to at least skim over the headers in my favorite folders--I found out about the contest.

The Netflix mention was a little funny, because I remembered downloading a huge file (1TB?) for _their_ contest and never even wrote a single line of code to process it. I still have to try to find it (now that their contest seems to be over) to free up some drive space :P.

This sums up my lack of free time and lack of resources rant. I still wanted to participate in the Github contest because it reminded me of the AppWatch days. We had a "people who like this app also like these apps" thing going on (inspired by the early Amazon features I think--this was early 1999) and I thought it was really cool and useful. That one ran as a perl script in the middle of the night, basically doing a recursive SQL query and fetching the results. It took about 5 min to run and not that much RAM (we "only" tracked about 2,000 apps and had a few thousand registered users).

So I had a motive, a few rusty tools (an AMD box built in 2001, 1GB RAM, half of that free--it's an old fileserver, my work machines are overloaded enough without running 100% CPU background tasks) and not much time. I knew that Ruby was not the right language for this but I wanted to give it a go. After all, what matters is how you solve the problem, not how fast you can process data structures. Right?

Yes, right. After spending a bit of time on the problem I realized a few things:

1) In order to get better scores in the contest, I'd need a faster language (C++ was my next choice if I had the time) and more free RAM;
2) I didn't think that the problem that inspired the contest would be solved, even if I achieved the maximum score (more on this later);
3) I still wanted to get a decent score.

And so I mostly abandoned my initial attempt (I think I found some time to make it somewhat "better", but after that my next idea used way too much RAM) and wondered how I could use the "open" nature of the contest to produce better results (with way less effort). I'm not a big fan of most free-form social classification techniques, because they're easily sabotaged and depend too much on the specific few that have a public or hidden interest in the matter at hand. But crowdsourcing is interesting to me because one gets to pick the crowd, i.e. you retain "editorial control" over the results.

To a trained mathematician this is of course ludicrous, but you can't escape the common sense behind crowdsourcing: "all of us is better than any of us". There's got to be a way to take each person's best effort and make the whole better than its best part.

I tried creating a crowdsourced compilation of the top results in the contest. This was very quick-and-dirty, I only ran it 3 or 4 times with similar data and little tweaking. It got me a 3rd spot in the rankings but IMHO proved me wrong. The data was too disperse to be properly weighted and reconstructed, and still keep its core value. In effect, the data was hard to blend, and my only tweaking option was to get even more similar to the top results, but not improve on them.

The few other entries I looked at looked pretty nice (code-wise) and used widely different approaches, which _should_ have resulted in a good crowdsourced blend. From a quick read of the related posts after the constest ended, this was not necessarily the case. It seems that there were plenty of other people trying indirect techniques similar to mine, which didn't really add any "intelligence" to the mix. Maybe my attempt was based on too many others like it, and that's why the results didn't improve as expected.

My goal was of course not to win the contest, I just wanted to try something out, and having the code right there available to other people to see and play with was the main drive for actually doing it. It was fun to see other people talking about it and getting new ideas from it.

So lets get back to "solving the problem". AFAICT the contest was about a data set that was processed to remove a few semi-random parts, and the goal was to guess the missing parts. This is fine and may lead to interesting discussions, but in my mind doesn't solve the problem. The problem should be: "how can we improve our recommendation system" and not "how can we anticipate what people will choose on their own".

Because people are not good at choosing. The fact that they made a choice doesn't mean it is the right one, even for themselves. There's time constraints involved, false assumptions, preconceptions, etc. I'm acutely aware of the fact that I probably don't use all of the gems and plugins that I should, and I spend _a lot_ of time evaluating tools. If you have a list of the products I currently use, then you know which products I use, that's it. And I already know which products I use :P

My point is, there were a lot of great entries that didn't get good scores. They were based on good ideas that would actually improve the overall productivity and maintainability for those who followed their advice. And quite frankly, I think that with that goal in mind, the crowdsource model would have fared a lot better.

With that in mind, I understand that to come up with a contest you need rules and goals, and you need a scoring system and all of that. I'm not criticizing the contest--I think it was great and I and lots of other people had a blast participating. I'm just saying that to actually come up with a better recommendation system there's a lot more that has to be considered. And from my experience both trying to figure out the solution to these problems, and trying to apply them to my daily tool/product/beer choosing needs, I still think that a self-refining-crowdsourced-over-time approach kicks the most ass!

Addiction

2009-08-10T20:41:00.001-07:00

Probably the best piece of hardware that I got my hands on recently was the Nokia E71. With it came the knowledge of the Symbian OS (S60), and the revelation that behind its quite non-sexy UI there's a Unix-ish openness to it. Then came the downfall: I installed FreeCell on the damn thing :P.

Not wanting to ruin the keyboard and joystick with action gaming (I need _some_ type of game to play while waiting in line for stuff) like I did with my last phone, I opted for a strategy game. Which became a minor (uh?) addiction. So minor that today I was only mildly inclined to search the interwebs for a freakin' FreeCell solver to convince myself that my current game was unsolvable. Aha! It wasn't :( Anyway at least that humbled me a bit, allowed me to cheat (to save time!) and presented me with some new build environments (ccmake anyone?). The project I'm talking about is Freecell Solver, and here's the game I needed help with (just to save time!):


QD  7H  QH  3C KH 2C 8S
8C  5H  AH  5D AD JH KC
5C  5S  9C  4D QC 8H 6D
AC  3D  7D  KS JS 7S JC
10D 10H 10S 3S KD 9H
4S  9S  4H  8D 6H 2H
10C QS  6S  6C 4C JD
3H  2D  2S  7C 9D AS

Try something like fc-solve -m -snx < <filename> to solve it. If you need to save even more time than I did ;) here's the solution:


8h 3a 4b 43 a3 4c 4d 41 4a 4h
74 a7 18 1h 1a 1h b1 12 1b c1
85 4c b4 58 4b d4 14 7d 7h 71
78 c7 1c b1 c1 5b 75 b7 75 27v3
2c 2h 2b 2h 6h b8 2b 25 b1 15
3b 32 31 c3 b2 6c 26 32v2 3b 13
d1 63 1d a1 21 d2 36 3a 3d b3
36 3h c3 63 2b a2 62 6c 6a c6
32 3c d3 63 6d b6 68 c6 3b 3c
b3 6b c6 3c a3 85 3a b3 23 6b
d6 c6 58 26 36 3d a3 85 3c b3
23 c2 58 2a d2 63 2b 51 5c 5d
52 53 d2 c2 1d 1h 5c 57 c2 65
54 82 8h 8h 8h 8h 3h 2h 3h 2h
3h 2h 3h 2h 3h 2h 2h 4h 2h 4h
2h 4h 2h 4h 2h 4h 2h 4h ah dh
bh 1h 3h 1h 3h 1h 1h 7h 1h 7h
1h 7h 1h 7h

This game is solveable.
Total number of states checked is 9510.
This scan generated 9606 states.

BTW is it just me or is the board layout for fc-solve in CxL notation instead of LxC notation? I keep having to turn my head 90 degrees to make sense of it. And then I can't read the codes :P.

Fans drive me crazy

2009-07-15T21:55:00.000-07:00

No, not that type of fan. The Canonical folks updated the Ubuntu 9.04 kernels and now my Aspire One started blowing its fan all the time again. After a bit of tinketing in ~/tmp and /etc/modules I remembered that the acerhdf module is _not_ included, so has to be reinstalled every time a new kernel shows up. So grab that and sudo make install.

xrandr to the rescue

2009-07-15T21:48:00.001-07:00

Just got a new Samsung 22" 1080p monitor and plugged it into an Acer Aspire One. I have 3 OSes in this machine, so decided to configure all 3 for the new monitor. Windows XP did a decent job at setting it up (there's an "Advanced" tab in the video preferences that has lots of cool options -- video card specific). MacOS X 10.5.5 was even better, it came up with a bogus config but as soon as I pulled the new monitors plug, it reverted to the main netbook screen and recognized the new monitor after plugging it back in. All very easy.

Unfortunately Ubuntu 9.04 was not as friendly. Although Xorg did recognize all valid modelines at boot, for whatever reason the intel driver refused to show any of the higher ones. The same thing happened when I configured other monitors in the past, but I forgot what I had done to fix it, and didn't want to hard code the xorg.conf file because I will be using different external monitors on a regular basis.

Anyway the solution is quite simple. Just force xrandr to recognize the modeline you want (xrandr --newmode and copy/paste the modeline from Xorg.0.log, then xrandr --addmode). Now go to the display preferences and your shinny new display mode will be available. Logout and login and all is done. Yay!

Swim on boot

2009-05-07T23:04:00.000-07:00

I've got my Nokia E71 syncing with Google (contacts, calendar) and Ovi (those and more). As you probably know the S60 platform doesn't sync by itself on a regular basis so most people use Swim for that task. According to the Swim page it should run at boot time but on my E71 it doesn't seem to. I use Powerboot to run a few apps at boot time, but couldn't find the Swim process by name to add it. Turns out that ("use the source, Luke!") it's called SyncServer.

Alexa cleaning up interfaces

2009-04-03T20:46:00.000-07:00

Now that Alexa's thumbnails are no longer supported via AWS (breaking their own site for a while :P) it seems that they decided to also revamp their remaining services and do so cleaning up. This broke a script of mine used to fetch per-country stats. It's a simple change, the UrlInfo call with RankByCountry now returns a Code property in the XML instead of the old code property (note the case change). So if you're using HPricot (like I am), you'll use a line like:

(h/"aws:country[@Code='#{countryCode}']"/"aws:rank").innerHTML.sub(/\s*$/, '')

to filter out the resulting rank.

jdresolve on GitHub

2008-08-11T01:24:00.000-07:00

I'm trying to force myself to switch to Git from Subversion for my current projects. The safest way to do it seems to be using an older project as a guinea pig :) So I just pushed jdresolve over to GitHub. I'll scavenge for the uncommitted patches and the unanswered emails regarding jdresolve and will commit to the Git repository soon (enough).

Google geocoding charset encoding currently broken

2008-07-27T15:22:00.001-07:00

Update (2008-09-09): Google seems to have finally fixed this.

About 2 weeks ago I started seeing weird characters when geocoding addresses via Google using the YM4R gem. The addresses are outside the US and so contain plenty of accented characters that used to be properly encoded in UTF8. Although Google's XML claims to return UTF8, it currently doesn't, sending what looks like ISO-8859-1 encoded characters in some fields instead. This is more than likely a problem with their outsourced partners not having properly setup UTF8 environments and updating GIS information using local encodings.

I search for the issue and found if mentioned in ep's blog. He ended up seeing the same error message:

#<REXML::ParseException: Missing end tag for 'DependentLocalityName' (got "DependentLocality")

That'll usually result in a nice 500 error in your Rails app (if that's what you use) because it raises an exception.

His solution (forcing further charset translations to occur) worked quite well, except that I had to mention the origin charset in my case. So instead of calling to_utf8 plainly, I pass it the charset with to_utf8('iso-8859-1'). This is a very ugly hack in all so I hope that Google fixes the issue soon. I personally didn't report the bug 'cause I never got any feedback from any information I ever sent their way or any requests that I've made in the past.

Fixing mint tables after a 32bit to 64bit migration

2008-05-21T09:34:00.000-07:00

I just migrated a web site from a regular 32bit processor (Intel(R) Xeon(TM) CPU 2.40GHz) to a 64bit slice (Dual Core AMD Opteron(tm) Processor 265). This made mint v2.14 (or actually the underlying MySQL database) start showing the visitor IPs as mostly 127.255.255.255. A little searching got me to here and here and here. I don't run PhpMyAdmin so here are the raw SQL command for you to run on your console:

alter table mint_visit modify ip_long int(10) unsigned;
alter table mint_visit modify referer_checksum int(10) unsigned;
alter table mint_visit modify domain_checksum int(10) unsigned;
alter table mint_visit modify resource_checksum int(10) unsigned;
alter table mint_visit modify session_checksum int(10) unsigned;

After running these a show columns from mint_visit should return the following:

+--------------------+---------------------+------+-----+---------+----------------+
| Field              | Type                | Null | Key | Default | Extra          |
+--------------------+---------------------+------+-----+---------+----------------+
| id                 | int(11) unsigned    | NO   | PRI | NULL    | auto_increment |
| dt                 | int(10) unsigned    | NO   | MUL | 0       |                |
| referer            | varchar(255)        | NO   |     | NULL    |                |
| referer_checksum   | int(10) unsigned    | YES  | MUL | NULL    |                |
| domain_checksum    | int(10) unsigned    | YES  | MUL | NULL    |                |
| referer_is_local   | tinyint(1)          | NO   | MUL | -1      |                |
| resource           | varchar(255)        | NO   |     | NULL    |                |
| resource_checksum  | int(10) unsigned    | YES  | MUL | NULL    |                |
| resource_title     | varchar(255)        | NO   |     | NULL    |                |
| search_terms       | varchar(255)        | NO   |     | NULL    |                |
| img_search_found   | tinyint(1) unsigned | NO   | MUL | 0       |                |
| browser_family     | varchar(255)        | NO   |     | NULL    |                |
| browser_version    | varchar(15)         | NO   |     | NULL    |                |
| platform           | varchar(255)        | NO   |     | NULL    |                |
| resolution         | varchar(13)         | NO   |     | NULL    |                |
| flash_version      | tinyint(2)          | NO   | MUL | NULL    |                |
| local_search_terms | varchar(255)        | NO   |     | NULL    |                |
| local_search_found | tinyint(1) unsigned | NO   | MUL | 0       |                |
| window_width       | smallint(5)         | NO   | MUL | -1      |                |
| window_height      | smallint(5)         | NO   | MUL | -1      |                |
| ip_long            | int(10) unsigned    | YES  | MUL | NULL    |                |
| session_checksum   | int(10) unsigned    | YES  | MUL | NULL    |                |
| visitor_name       | varchar(255)        | NO   |     | NULL    |                |
+--------------------+---------------------+------+-----+---------+----------------+
23 rows in set (0.00 sec)

The sooner you fix your database the less info you'll lose. Took me a couple of hours to realize this was happening...

Keeping VNC clipboards in sync

2008-03-05T19:05:00.000-08:00

My typical "desktop" consists of two environments: the local machine (usually my notebook) and a VNC session to some fancier hardware. The logic behind this is that I can play around with my local apps and even reboot if I need to without losing any real work, which is usually done on the remote machine.

This works out quite well except for sharing information between environments. Quite often I'm browsing a web site inside the VNC session and want to open it up in my local environment, to have access to multimedia features or simply not to bloat my pristine remote session. Trouble is that X11 has a concept called the "cutbuffer" and another for "selections", which depend on which apps you're running. And, you've guessed it, VNC syncronizes the wrong one (namely the cutbuffer). So to simply copy a URL between sessions I had to select the URL using the mouse, paste it (middle-click) into an xterm window, select it again (!), switch to the other environment, paste it (middle-click again) into another xterm, then re-select it, and finally paste it into my other browser. This took care of syncing all of the related buffers manually.

Today I finally got fed up with it after doing this 3 times within 10 minutes. I found a project called autocutsel that does this automatically for me. Now all I have to do is keep autocutsel running in the background, and every time I copy something inside the local browser (using Ctrl-C) it is instantly available for use inside the VNC session (and vice-versa if I run autocutsel both in my local and my remote VNC environments).

Setting WEP keys in Kamikaze

2008-02-23T11:20:00.000-08:00

I finally had a pressing need to OpenWRT-ize a Linksys WRT54GS v4 I bought over a year ago. The idea is to use the unit for field testing, so I want to be able to constantly change settings without rebooting or running complex scripts. Basically I want to SSH into the WRT and change settings at will as easily as possible.

The version of OpenWRT that I flashed was Kamikaze 7.09 and it has been working beautifully. Very fast boot times, organized filesystem and configuration structure, and plenty of RAM and Flash to spare. The radio in this box is a Broadcom, which I'm not very used to. I have a BCM4318 in my budget Acer Aspire 3002 laptop, and it works quite well, but I basically use it for connecting to a single access point using WPA encryption.

On the WRT box (which I fondly hostnamed 'wart') I wanted to try out WEP (yes, the old, insecure, useless WEP), mostly because a lot of access points in my area use it. Typically one could just use something like iwconfig wl0 enc <WEP key> to accomplish that. Sadly this doesn't work, although no error message is returned. Simply nothing happens, and the status of the interface ends up like:

wl0       IEEE 802.11-DS  ESSID:"MUSIK"
          Mode:Managed  Frequency:2.462 GHz  Access Point: 00:02:2D:0D:1B:39   
          Tx-Power:19 dBm   
          RTS thr:2347 B   Fragment thr:2346 B   
          Encryption key:<too big>
          Link Signal level:-42 dBm  Noise level:-96 dBm
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:516  Invalid misc:0   Missed beacon:0

Notice the Encryption key:<too big> line. It seems that the wireless tools don't play too well with the Broadcom's "advanced" settings. Even after I got this to work I still got the same output from iwconfig.

After some searching I found out that the utility that controls most of the internals of the Broadcom hardware is called wlc. Sure enough it was a "wepkey" option that should do the trick. The syntax seemed to be wlc wepkey <WEP key>, only that always returned:

Command 'set wepkey' failed: -1

I couldn't find a docs for wlc by searching online, so I fetched the full Kamikaze source tree to check its source code (hail to Open Source!). It turns out that the syntax expects you to inform the key slot to store to (1 through 4), and to use "=" to assign the "PRIMARY KEY", which is the slot that will be actually used. Thus, the command that worked for me was wlc wepkey =1,<WEP key>. Notice that I'm using an hex WEP key. If you use an ASCII WEP key, prefix it with "s:".

And that did it for me. Now with a simple call to wlc I can change WEP keys on the fly and get instant results.

Tinkering with Amazon's S3 and EC2

2007-07-02T22:28:00.000-07:00

I was really excited about Amazon's EC2 (Elastic Compute Cloud) when it first became available sometime last year. Unfortunately it was (and still is) a "limited beta" service (yeah I know) so all I could do was sign up for it and hope that my day would come.

It did :) (big surprise!).

Anyway I've been playing around with my SliceHost VPS (which they of course call a "slice") for a week and was hoping to have access to a similar service at low cost to compare, so the timing was just about perfect. Note that I also had to wait for my SliceHost slice, but it was a lot quicker (about 8 weeks). I have to get a hold of the specifics for both services to compare price and performance, but my main concerns right now are ease of deployment and flexibility, so that'll have to wait.

My initial guess is that both services will behave pretty much the same after they're setup. You start with a basic OS to play with anyway. I probably won't trust anyone else's image and just roll out my own customizations.

Back to the Amazon services, you can't just use EC2 out of the box. First you need to sign up for S3 (Simple Storage Service), which is pretty awesome on itself but I hadn't used yet (mostly for lack of real need). So I did that (no wait for that one) and decided to explore it a bit before moving on with my EC2 experiments.

At first a nice command-line client to S3 would be really nice, to get a general feel for the service without having to dive into the API. I did a quick search and started using s3cmd. It's written in Python, which I like a lot, and has a very clean feel to it. The only problem was that it didn't use my outgoing HTTP proxy (Squid) like other Python apps usually do.

If you have a http_proxy environment variable, the Python libs will generally detect that and pass your requests onto the proxy. After inspecting s3cmd's code, I realized that due to a multi-step process in generating the webservice requests, that functionality was being sidestepped. This required some minor tweaking to add proxy support to s3cmd's config file and then use that to properly craft the requests for proxy and non-proxy situations. I created a patch and sent it to the maintainer. Of course you can just download the patch and apply it yourself--it's against version 0.9.3 of s3cmd.

Other interesting projects related to S3 are s3sync.rb (a Ruby backup script similar to rsync) and Backup Manager (a general backup tool that I've been told supports S3 now), but I haven't tested them yet.

Having played with S3 I decided to complete my first EC2 setup. After jumping through all the hoops in the Getting Started guide, I chose Paul Dowman's image for my first boot. It worked like a charm, and in less than 1 minute (!!) I was able to login as root to my newly created instance, complete with Rails, MySQL and an up-to-date Ubuntu Feisty install.

BTW, the EC2 tools (in Java) also don't honor the http_proxy environment variable (sigh). Here's the environment variable you have to set to make it all work, assuming your proxy is not password protected:

export EC2_JVM_ARGS="-DproxySet=true -DproxyHost=<host> 
  -DproxyPort=<port> -Dhttps.proxySet=true 
  -Dhttps.proxyHost=<host> -Dhttps.proxyPort=<port>"

Having access to EC2 opens up a lot of possibilities in deployment and scaling. From what I've been reading, the main complaint about EC2 is the lack for persistent storage. I don't see that as too much of a problem because you do have storage as long as the instance is running, and even if it crashes (doubtful) from my understanding the data is intact. I rebooted my instance and it came back unscratched. Only if you forcibly shutdown ("shutdown now -h") do you destroy the image and lose your data. Plus, you can always backup to S3, even the whole image block by block if you'd like. Another common complaint is about the lack of static IPs, but that can also be worked around with dynamic DNS services.

After successfully booting my first EC2 instance I now plan to work on some deprec recipes to deploy rails apps on either SliceHost or EC2, transparently. My target system will be an Ubuntu Feisty base, with PostgreSQL 8.2, Apache 2.2 and Mongrel.

Using the Compaq PA-1 with Linux

2005-03-27T01:34:00.000-08:00

I purchased one of these little guys in 2001 and used it very little at the time. It took forever to transfer songs, only worked under Winblows, and the capacity was very limited (they come with two 32MB MMC cards).

Fast-forward 4 years and podcasting is born. It's really nice to always have new interesting content to listen to while you're comuting or working outside. So I downloaded a bunch of podcasts (in MP3 format for now) and proceeded to install the Compaq-bundled RioPort software inside a Winblows session in VMWare. File transfers worked one third of the time, and drained the batteries pretty badly. The real problem, however, was the time wasted in other parts of the process. Waiting one full minute for RioPort to read the list of files transfered was too much. As with anything else in Winblows, it's not the apps that suck that much, but the OS just makes the user experience a real nightmare.

Being a happy resident of a non-DMCA encumbered country, I decided to reverse engineer the filesystem used to store the files to the flash cards. This way I can transfer files without using the PA-1 itself, which saves on the USB hassles and uses zero battery power. The first step was to dump a working flash image and examine it using a binary editor (bvi in this case).

It turns out that the filesystem was created by a company called Eiger M&C, which doesn't seem to be doing business anymore. I even tried emailing their contact listed on their website (last updated in 2002), but of course got no reply. To make a long story short, I ended up successfully reverse engineering most of the filesystem format, and used a bare bones version of it as the basis for a small Python script.

And so was born jdeigerfs v0.1 (3.4KB) :) It allows you to generate a mm.img file that contains a filesystem image that you can copy to any flash card. I use 32MB and 64MB MMC cards on my device, but your device may use other cards/sizes. All should work, up to 128MB per card. From what I can tell the format used for the FAT reserves 128KB for a 1 to 1024 mapping of the flash card, so anything over 128MB would actually cause the FAT to overwrite the first file in the card.

The script is barely useable. Actually it's a bit better than that, and in a works-for-me state. I decided to release it early so that if anyone else has any use for it I can get feedback at an early stage, although I don't plan on making any major improvements to it. Now I can finally test if the claim to support AAC is true (hard to believe for 2000 hardware). Later.

Multi-DVD backups using zero disk space

2005-03-03T01:38:00.000-08:00

As a few million other people, I have started doing my backups on DVD+RW. With a capacity of 4.7GB (that's 4.7 billion bytes, not 4.7 * 2^20 bytes), fast write speeds (compared to CD-RW) and the ability to reuse the media thousands of times, it's hard to ask for more.

Unfortunately the problem when backing up to DVD starts when you have to choose a format. You could theoretically use any default filesystem that your OS likes and burn that directly to the media, but it would be highly incompatible with any other OS. One of the desired characteristics of a backup is the ability to restore easily under any circumstance (or any OS).

That basically leaves us with ISO-9660 as a format. Virtually every OS supports that. Of course there's your Rock Ridge extensions for Unix, and your Joliet for Windows, but that's easy to implement (most software supports both). The problem is, even with these extension, the ISO-9660 format is pretty limited. It needs a lot of hand-holding in order to solve duplicate filenames (inside different directories, which is quite common in any filesystem), and the most common utility to generate such a filesystem (mkisofs) tends to require _a lot_ of switches to do what you want.

OK, so all we have to do is come up with a script to feed mkisofs with the proper switches, resolve the duplicate filenames, and we're set, right? Not quite.

Making an ISO-9660 image of your data and then burning it would require lots of temporary storage. At least the 4.7GB to be exact. And in lots of situations, that temporary space just won't be available, or your /tmp or /home partitions may be too full to fit that image in. That's why we need to backup in lots of situations--to free up some space. How can you free space up when you need _more_ space to do it? Sounds like asking a bank manager for a loan--he'll want you to prove that you already have the money in order to lend it to you!

Back to software land, we'll need a neat utility called growisofs. It is named like that for historical reasons, but can actually burn the DVD for you, as well as making the ISO-9660 filesystem. The strategy here will be to identify the files that we're backing up, and group them until we reach the media size, then provide that list of files to growisofs so that it can make the filesystem and burn it on the fly, without using temporary storage :)

Another alternative would be to use the mkisofs -stream-media-size switch, but that way we could end up splitting up a file (I think--I didn't actually test this), which is not at least what I personally want with my backups. Notice that my technique here can waste a lot of space if you have lots of huge files, and won't even work at all if you have files larger than 4.7GB. I use this script to backup my pictures, music, and data. For movies and other large files I create a directory, move files to fit nicely inside the 4.7GB, and then backup "." (the current directory) using the same script. Works quite nicely.

Please also note that this script is not for production purposes. It's a hack that I came up with to do simple yet effective backups to DVD. Again, works fine for me. YMMV.

And finally for the script itself. You can find it here. It takes only 2 command line parameters: the volume label prefix, and the directory to backup. The volume label of your burned DVDs will be the prefix appended with "_01", "_02" and so on. There's a bug where when the backup is finished it'll still ask you for one more DVD. Just press Enter and it will quit harmlessly (without turning your DVD into a coaster ;)).

My own DNSBL

2005-01-06T04:13:00.000-08:00

My trash folder used to hold about 2,000 spam (and non-spam) messages. Any mail older than 7 days is automatically deleted. Most of what was there never got to my email client, because I use bogofilter to do bayesian spam filtering.

That worked well on its own until I started getting _tons_ of spam. I wrote a bunch of scripts to identify the offending IPs and compile them into my own DNSBL (DNS Block List). It is publicly available at dnsbl.jdrowell.com. That's not a homepage, but a domain for the reverse IP lookups.

Since I started using this DNSBL, my trash folder trimmed down to about 200 messages (for the week). That includes my legitimate email (which I read and then delete). Not bad :) It also unloads my mail server, and, most importantly, makes spammers really angry. And poor. And suicidal (I wish).

The current count for dnsbl.jdrowell.com is about 70,000 IPs. I don't add blocks, only single IPs. I don't remove IPs unless I feel like it. I don't recommend that anyone use this DNSBL to actually block messages, but instead to flag spam as part of some greater process, such as using SpamAssassin or another similar tool.

That's about it. At a rate of about 2,000 new IPs every day (boy do I get spammed!), I'll probably have over 100,000 spam sources identified by the time you read this! Bring on the zombie botnets!

Update: The database has over 1 million IPs now. Scary.

Fighting SPAM with DNSBL

2004-12-09T03:37:00.001-08:00

I've been getting an average of 20,000 spam emails a day on one of my servers. Apparently some nice spammer included a domain I own as a target for his zombies. That means I kind of get DDOS'ed with spam :P

Most approaches to filtering spam don't work well when you're only getting a spam or two from each IP that connects to your server. For instance, one very nice way of catching spammers is by placing a few honeypots around and then blocking whatever IP sends mail to them. Unfortunately the kind of spam I'm getting is really dumb, in the form of messages to addresses that _don't_ exist. This causes the message to bounce back to the faked originating address. I say it is dumb because the person who actually receives the bounce gets it is "error" form, not as the clean original message, and thus will more than likely not read it. Even if they do, they'll be pretty sure that they didn't send the message, and will not click on the spam link that they supposedly sent someone else. SIGH!

Anyway, some pathetic spammer with a fairly big botnet thinks it's a great idea and decided to bounce some of his trash off my server. I'd really like to block that spam _before_ it gets delivered to my SMTP server (Exim in my case--yes it's very l33t). That being the case I created a tiny Perl script to tail the Exim log files and block access to port 25 from any IP that sent me spam. The idea was to prevent any further spam from that IP from even connecting to my box.

That worked fine and dandy, with only a small problem (or two). Very few IPs returned to spam me again. As I said, this guy's botnet is quite large, and many of his zombies have dialup or dynamic IP DSL/cable. The other problem is that there are just _a lot_ of them. A single day of logging resulted in over 15,000 IPs added to my firewall.

OK, let's go to plan B. Lots of other people are getting this spam, right? Let's see what they're doing about it! Turn out that a very efficient way of dealing with this type of bot is by allowing a pool of servers to rat on the IPs that are delivering spam. That way other servers can block their spam _before_ it's delivered. I guess this is how Vipul's Razor works, but I've never gotten to install it. I just used the lazy approach: filter whatever everyone else is filtering.

Most people don't worry that much about spam because they get only a few messages a day. ISPs and large companies, however, _do_ mind. And so a few "central" facilities for consolidating these spam sources were born. To distribute the data, a very clever approach is used: DNS. The fact that everyone that uses the Internet already uses DNS, and that it is distributed and has built-in caching, and deals with IPs, make it the prime candidate for the job. All that has to be done is to create a dummy (non-authoritative) reverse zone, and then clients can query the database using W.Z.Y.X.dnsbl.domain.tld to check if IP X.Y.Z.W is blacklisted. BTW "DNSBL" simply means DNS Block (or Black) List.

This all sounds quite complicated, but to implement it with Exim takes only a few lines. Exim4 supports ACLs (Access Control Lists), so all you have to do is add an ACL entry:

  deny   hosts = !+relay_from_hosts
         message = $sender_host_address is listed \
                   at $dnslist_domain
         dnslists = dnsbl.njabl.org : \
                    bl.spamcop.net : \
                    dnsbl.sorbs.net : \
                    blackholes.five-ten-sg.com : \
                    cbl.abuseat.org : \
                    psbl.surriel.com : \
                    list.dsbl.org

I chose to not check for spam from anything in my relay_from_hosts variable (for obvious reasons). You basically choose a message to use when rejecting (and logging) an attempt of spam delivery, and specify a list of domains to be used for the reverse mapping checks. Normally these DNS servers will return NXDOMAIN for regular IPs, or 127.0.0.2 for known spam sources.

So there you have it. I came up with my list of DNSBL sources by searching the excelent OpenRBL (a kind of DNSBL meta-search) for the spam sources that reached my box.

Also note that on of my DNSBL sources is psbl.surriel.com. This is Rik van Riel's (of Linux Kernel hacking fame) site, and is powered by Spamikaze, a tool that I plan to run on one of my boxes soon. The plan is to have my own DNSBL based on the spam that still gets through to my box.

I'll end this entry with a big "THANKS!" to all the projects mentioned (this is all free, folks) and look forward to paying them back with some pizza and beer in the future.

If you're a Dilbert fan...

2004-11-19T05:51:00.002-08:00

...then surely you must be a geek, just like me. I can't say that I'm insanely into comic strips or anything like that, but a visit to comics.com made me want to have all Dilbert strips really bad.

While I'm not prepared to pony up for their paid service, I did subscribe to the free "Basic" service to see what it looks like. Not being able to wait to start my collection, I decided to fetch the 30 or so strips that are available in the Archive for free.

Obviously I'm not the first person to have such an urge. The guys at comics.com don't use a very common filename scheme for their content, presumably for the exact reason of making mass-fetching harder. A quick search on Google showed that every geek and his grandma has already written a script to fetch these strips. Instead of making my life easier, this simply proved that every minimally proud Dilbert fan must make his/her own script.

And so I proceeded to code my hack, jdilbert. Don't expect much--it's just a hack--but it does work. So much so that my collection now contains exactly 31 strips :)

Update: I rewrote this in ruby and it now handles more comics. Fetch jdstrips now :) I put it in a weekly cron job so that I never miss any.

Multi-head screenshot

2004-11-07T08:45:00.000-08:00

I was talking to my good friend gzp on AIM, bragging about my new (but made from old parts) box, and the fact that it had RAID1, RAID5, three heads, etc. So he asked me for a screenshot. I tried to use xwd to make it--to no avail--so he told me about scrot.

And what can I say, it's just sweet! Here's the result.