So, I’m in Israel working with the team here to plan a large infrastructure project and while here I was asked to do a talk on Hadoop for the team as well as the Israel Tech Talks group.
Last week I attended both Hadoop World 2010 and Cloudera’s Hadoop Admin Training which gave me a long list of items to append to my already long list of things to do to make my Hadoop cluster add even more value for my employer.
The training was fantastic! I cannot recommend it enough, it was well worth the money. I was very lucky to have Eric Sammer teach it. That guy really knows his stuff inside and out, absolute rockstar.
(note: Cloudera’s certification exam is pretty subtle and tricky and, boy, do you need to really understand Hadoop to pass it, not just memorize facts)
I’ve had a cluster for about a year and I know my way around it, but the training solidified what I knew, taught me the whys behind the tips I’ve picked up around the web and filled in all the holes in between. Especially since we built our cluster with no notion of our usage of it. It was an unplanned beast.
Like I said, I’ve got a long list of tweaks to apply. As I work through them I hope to post here about them, as I’ve been hoping to post more here about most things. I’ve got a huge backlog of junk to post here. BTW, Outbrain is hiring me a Co-Pilot here in the NYC Office.
Hive is a pretty nifty data warehousing extension of Hadoop that lets you dump structured data into HDFS and query it using a SQL-like language called HiveQL which runs all the map/reduce junk for you.
It’s pretty darn simple to install, but if you want to really free it up you need to do some tweaking.
From Hadoop’s homepage:
Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
In short it’s a distributed batch processing mechanism that stores data across an array of nodes. Computing of that data is done on or near the node with the data and is reported back to a master. You can run other apps on top of that batch processing interface.
Hadoop provides the tools to:
- store data distributed over several nodes with a configurable level of redundancy
- farm out that processing of that data to the nodes storing it by mapping the task into a bunch of smaller jobs then combing the results returned into a coherent result.
Installing and setting up Hadoop isn’t too difficult, but there are a few intial pitfalls with the initial provided configuration files.