Empowering Knowledge: 02/25/09

What is Hadoop?

Hadoop is a an open source Apache project written in Java and designed to provide users with two things: a distributed file system (HDFS) and a method for distributed computation. It’s based on Google’s published Google File System and MapReduce concept which discuss how to build a framework capable of executing intensive computations across tons of computers. Something that might, you know, be helpful in building a giant search index. Read the Hadoop project description and wiki for more information and background on Hadoop.

What’s the big deal about running it on Windows?

Looking for Linux? If you’re looking for a comprehensive guide to getting Hadoop running on Linux, please check out Michael Noll’s excellent guides: Running Hadoop on Ubuntu Linux (Single Node Cluster) and Running Hadoop on Ubuntu Linux (Multi-Node Cluster). This post was inspired by these very informative articles.a

Hadoop’s key design goal is to provide storage and computation on lots of homogenous “commodity” machines; usually a fairly beefy machine running Linux. With that goal in mind, the Hadoop team has logically focused on Linux platforms in their development and documentation. Their Quickstart even includes the caveat that “Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so this is not a production platform.” If you want to use Windows to run Hadoop in pseudo-distributed or distributed mode (more on these modes in a moment), you’re pretty much left on your own. Now, most people will still probably not run Hadoop in production on Windows machines, but the ability to deploy on the most widely used platform in the world is still probably a good idea for allowing Hadoop to be used by many of the developers out there that use Windows on a daily basis.

Caveat Emptor

I’m one of the few that has invested the time to setup an actual distributed Hadoop installation on Windows. I’ve used it for some successful development tests. I have not used this in production. Also, although I can get around in a Linux/Unix environment, I’m no expert so some of the advice below may not be the correct way to configure things. I’m also no security expert. If any of you out there have corrections or advice for me, please let me know in a comment and I’ll get it fixed.

This guide uses Hadoop v0.17 and assumes that you don’t have any previous Hadoop installation. I’ve also done my primary work with Hadoop on Windows XP. Where I’m aware of differences between XP and Vista, I’ve tried to note them. Please comment if something I’ve written is not appropriate for Vista.

Bottom line: your mileage may vary, but this guide should get you started running Hadoop on Windows.

A quick note on distributed Hadoop

Hadoop runs in one of three modes:

Standalone: All Hadoop functionality runs in one Java process. This works “out of the box” and is trivial to use on any platform, Windows included.

Pseudo-Distributed: Hadoop functionality all runs on the local machine but the various components will run as separate processes. This is much more like “real” Hadoop and does require some configuration as well as SSH. It does not, however, permit distributed storage or processing across multiple machines.

Fully Distributed: Hadoop functionality is distributed across a “cluster” of machines. Each machine participates in somewhat different (and occasionally overlapping) roles. This allows multiple machines to contribute processing power and storage to the cluster.

The Hadoop Quickstart can get you started on Standalone mode and Psuedo-Distributed (to some degree). Take a look at that if you’re not ready for Fully Distributed. This guide focuses on the Fully Distributed mode of Hadoop. After all, it’s the most interesting where you’re actually doing real distributed computing.

ICFAI papers