Friday, June 18, 2010

Get your text back

Playing around with Nokogiri to process a bunch of HTML files, I ended up pushing the results to MongoDB. Remember, MongoDB _always_ stores UTF-8, which is a good thing. Unfortunately XML transforms like to escape anything they can, so I ended up with strings like:

ruby-1.9.1-p378 > a="José Saramago, 1922-2010"
=> "José Saramago, 1922-2010"

The old CGI module (RIP) could unescape these. Today REXML seems to be mode adequate. Here's a tip I picked up:

ruby-1.9.1-p378 > require 'rexml/document'
=> true
ruby-1.9.1-p378 > REXML::Text.unnormalize(a)
=> "José Saramago, 1922-2010"
ruby-1.9.1-p378 > REXML::Text.unnormalize(a).encoding
=> #<Encoding:UTF-8>

Props to the original posters.