Friday, September 16, 2011

Hadoop for Everyone

Apache Hadoop is an open source platform for running really huge jobs on a whole cluster of machines. Some of the most interesting problems in the world can only be practically solved using the power that hadoop can harness. The biggest problem for developers like me is that we don't have the time, space, or money to install a cluster of hundreds or even thousands of machines. Even if I did, I would be a big pain to maintain all those machines. And even though I work for a big company with all the resources to make a hadoop cluster, asking for 100 machines so I can "just try something" is never gonna fly.

One of the major pain-points of hadoop, is the fact that not all machines are treated the same. Some machines have to function in roles like NameNodes which require large amounts of RAM and some level of high availability/redundancy. On the other hand, the vast majority of machines can function as workers and can be simple commodity machines for cost savings. This means any efficient hadoop cluster is going to be a heterogeneous environment which further increases maintenance costs.

So what can we do? What we need is an on-demand hadoop cluster that you can pay for what you use. Amazon's EC2 and S3 have typically been used to provide metered webservices and data storage. However, deploying a hadoop cluster with heterogeneous server instances on a remote cloud still requires you to do a lot of the setup. Amazon realized this and created their Elastic MapReduce service. Now you can run your hadoop jobs on-demand with very little configuration and yet still have complete control over what class of machine you want to assign to the different roles in your cluster.

This makes a lot of sense for software developers like me. During development, I may use a small cluster to prove that my stuff is working. During QA, we might vary the size/configuration to see how our solution scales and do performance tuning. Our marketing and sales teams can have demos ready on the cloud anytime, anywhere in the world to showcase prospective customers. We would also have a good solution for customers who do not want a huge IT spending outlay before they are convinced of the value of our product. This might even absolve us of any legal/privacy concerns by letting the customer make their own agreements with Amazon whereas we provide just the software.

No comments: