Productionizing the Hive Thrift Server.

Ha! First day of my long awaited vacation and what do I do? Write a blog post about stuff I do at work of course!

A good portion of our team prefers to interface with Hive programatically using the Hive Thrift Server

The more we rely on it, the more we need to harden it.

It is not really setup or packaged for this so we need to go to town on it.

Continue reading

Share

Advanced Hadoop NameNode and Hive Metastore Backup Scripts

World Backup Day was last Thursday and in its honor I uploaded a few of my backup scripts to my github repository.

I thought I’d start off with modified versions of the scripts I use in production at Outbrain to backup my Hadoop NameNode and Hive Metastore.

First:  OMFG WTF ARE YOU NOT BACKING UP YOUR NAMENODE AND HIVE METASTORE?

Second:  No really, WTF IS WRONG WITH YOU!?!

Continue reading

Share

Slides and notes from my recent Hadoop talk in Israel.

So, I’m in Israel working with the team here to plan a large infrastructure project and while here I was asked to do a talk on Hadoop for the team as well as the Israel Tech Talks group.

Continue reading

Share

Daemonizing the Apache Hive Thrift server on CentOS

Earlier I showed you how to setup Hadoop, then how to setup Hive to use a MySQL-backed Metastore.

These notes presume that you have setup your Hive metastore to use MySQL. If you don’t you’ll only be able to have one Hive instance running at a time (so no CLI while the HWI or thrift server is a-runnin’)

Got carried away, I daemonized myself :P

Continue reading

Share

Getting the Hive Web Interface (HWI) to work on CentOS

The Hive Web Interface is a pretty sweet deal. It is what it sounds like, a web interface that abstracts the user from the CLI. It allows all your busy little business bees to make data warehouse honey without getting thier hands dirty.

The one in the yellow stripes is the COO…

Continue reading

Share

Installing Apache Hive with a MySQL Metastore in CentOS

Hive is a pretty nifty data warehousing extension of Hadoop that lets you dump structured data into HDFS and query it using a SQL-like language called HiveQL which runs all the map/reduce junk for you.

It’s pretty darn simple to install, but if you want to really free it up you need to do some tweaking.

Continue reading

Share

Setting up Cloudera’s Hadoop CDH2 Distribution on CentOS

Hadoop

From Hadoop’s homepage:

Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

In short it’s a distributed batch processing mechanism that stores data across an array of nodes. Computing of that data is done on or near the node with the data and is reported back to a master. You can run other apps on top of that batch processing interface.

At Outbrain we are in the process moving our data warehouse over to a Hadoop/Hive setup and we currently use it in production to serve reports to our users.

Hadoop provides the tools to:

  • store data distributed over several nodes with a configurable level of redundancy
  • farm out that processing of that data to the nodes storing it by mapping the task into a bunch of smaller jobs then combing the results returned into a coherent result.

Installing and setting up Hadoop isn’t too difficult, but there are a few intial pitfalls with the initial provided configuration files.

Continue reading

Share