
From Hadoop’s homepage:
Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
In short it’s a distributed batch processing mechanism that stores data across an array of nodes. Computing of that data is done on or near the node with the data and is reported back to a master. You can run other apps on top of that batch processing interface.
At Outbrain we are in the process moving our data warehouse over to a Hadoop/Hive setup and we currently use it in production to serve reports to our users.
Hadoop provides the tools to:
- store data distributed over several nodes with a configurable level of redundancy
- farm out that processing of that data to the nodes storing it by mapping the task into a bunch of smaller jobs then combing the results returned into a coherent result.
Installing and setting up Hadoop isn’t too difficult, but there are a few intial pitfalls with the initial provided configuration files.
Continue reading →