Monday, July 2, 2007

Tinkering with Amazon's S3 and EC2

I was really excited about Amazon's EC2 (Elastic Compute Cloud) when it first became available sometime last year. Unfortunately it was (and still is) a "limited beta" service (yeah I know) so all I could do was sign up for it and hope that my day would come.

It did :) (big surprise!).

Anyway I've been playing around with my SliceHost VPS (which they of course call a "slice") for a week and was hoping to have access to a similar service at low cost to compare, so the timing was just about perfect. Note that I also had to wait for my SliceHost slice, but it was a lot quicker (about 8 weeks). I have to get a hold of the specifics for both services to compare price and performance, but my main concerns right now are ease of deployment and flexibility, so that'll have to wait.

My initial guess is that both services will behave pretty much the same after they're setup. You start with a basic OS to play with anyway. I probably won't trust anyone else's image and just roll out my own customizations.

Back to the Amazon services, you can't just use EC2 out of the box. First you need to sign up for S3 (Simple Storage Service), which is pretty awesome on itself but I hadn't used yet (mostly for lack of real need). So I did that (no wait for that one) and decided to explore it a bit before moving on with my EC2 experiments.

At first a nice command-line client to S3 would be really nice, to get a general feel for the service without having to dive into the API. I did a quick search and started using s3cmd. It's written in Python, which I like a lot, and has a very clean feel to it. The only problem was that it didn't use my outgoing HTTP proxy (Squid) like other Python apps usually do.

If you have a http_proxy environment variable, the Python libs will generally detect that and pass your requests onto the proxy. After inspecting s3cmd's code, I realized that due to a multi-step process in generating the webservice requests, that functionality was being sidestepped. This required some minor tweaking to add proxy support to s3cmd's config file and then use that to properly craft the requests for proxy and non-proxy situations. I created a patch and sent it to the maintainer. Of course you can just download the patch and apply it yourself--it's against version 0.9.3 of s3cmd.

Other interesting projects related to S3 are s3sync.rb (a Ruby backup script similar to rsync) and Backup Manager (a general backup tool that I've been told supports S3 now), but I haven't tested them yet.

Having played with S3 I decided to complete my first EC2 setup. After jumping through all the hoops in the Getting Started guide, I chose Paul Dowman's image for my first boot. It worked like a charm, and in less than 1 minute (!!) I was able to login as root to my newly created instance, complete with Rails, MySQL and an up-to-date Ubuntu Feisty install.

BTW, the EC2 tools (in Java) also don't honor the http_proxy environment variable (sigh). Here's the environment variable you have to set to make it all work, assuming your proxy is not password protected:

export EC2_JVM_ARGS="-DproxySet=true -DproxyHost=<host> 
-DproxyPort=<port> -Dhttps.proxySet=true
-Dhttps.proxyHost=<host> -Dhttps.proxyPort=<port>"

Having access to EC2 opens up a lot of possibilities in deployment and scaling. From what I've been reading, the main complaint about EC2 is the lack for persistent storage. I don't see that as too much of a problem because you do have storage as long as the instance is running, and even if it crashes (doubtful) from my understanding the data is intact. I rebooted my instance and it came back unscratched. Only if you forcibly shutdown ("shutdown now -h") do you destroy the image and lose your data. Plus, you can always backup to S3, even the whole image block by block if you'd like. Another common complaint is about the lack of static IPs, but that can also be worked around with dynamic DNS services.

After successfully booting my first EC2 instance I now plan to work on some deprec recipes to deploy rails apps on either SliceHost or EC2, transparently. My target system will be an Ubuntu Feisty base, with PostgreSQL 8.2, Apache 2.2 and Mongrel.