I’m such a sad bastard.
I got stuck fixing a production issue and had to miss the inagural NYC Cassandra Meetup group
To attone, I figure I’d write a quickie Cassandra post.
Earlier I showed you how to setup Hadoop, then how to setup Hive to use a MySQL-backed Metastore.
These notes presume that you have setup your Hive metastore to use MySQL. If you don’t you’ll only be able to have one Hive instance running at a time (so no CLI while the HWI or thrift server is a-runnin’)
Hive is a pretty nifty data warehousing extension of Hadoop that lets you dump structured data into HDFS and query it using a SQL-like language called HiveQL which runs all the map/reduce junk for you.
It’s pretty darn simple to install, but if you want to really free it up you need to do some tweaking.
From Hadoop’s homepage:
Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
In short it’s a distributed batch processing mechanism that stores data across an array of nodes. Computing of that data is done on or near the node with the data and is reported back to a master. You can run other apps on top of that batch processing interface.
Hadoop provides the tools to:
Installing and setting up Hadoop isn’t too difficult, but there are a few intial pitfalls with the initial provided configuration files.
From Cassandra’s site:
The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.
It’s one of the more popular NoSQL data stores out there and at Outbrain we’ve been moving some parts of our service from MySQL to it.
I love it as an Ops guy because it just sorta works… I set it up, fire it up and it goes. Mind that you need to set it up right in the beginning, but that is another thing all together and I won’t get into configuration and implementation, just deployment.