Thought I’d play a little with Hadoop 0.23 (a.k.a YARN, MR2, NextGen Hadoop) and dump my notes here.
Gotta keep my skillz sharp y’all so I don’t become irrelephant. (Yes, that just happened.)
Below I just setup a pseudo-distributed mode setup and run some examples on it, nothing crazy.
I’m hoping to test and write more on how 0.23 differs from the main line 0.20.x, 1.0 and CDH3 releases as well as playing with the NameNode federation and using some other paradigms like MPI, Hama and Spark.
Let me prefix this rant/post by stating that I come from the more scrappy, ‘build it out from OSS’ sort of shop, so I am highly biased toward the approach of:
- Thinking first about, and then building, your infrastructure/solution to fit your needs.
- Knowing the software inside and out.
- Relying on the community for the rest.
Over the converse ‘Enterprise’ approach of:
- Building your infrastructure based on someone’s white paper on how you should build an infrastructure to do X.
- Getting your sysadmins a set of meaningless certifications.
- Ultimately relying on commercial support as your last point of escalation.
Yes, yes, yes… this is very snobby and I am in danger of sounding as irreverent as Ted Dzuiba. I am also wholly conscious that OSS approach can be taken in the same extreme direction as the enterprise approach. in so much that everyone blindly follows the same design choices that Twitter or Facebook are doing (albeit better than anyone else) or are implementing everything in node.js or Ruby on Rails because that is what the hot-as-shit hipster developers are doing.
For me it comes down to having operational responsibility for your infrastructure, rather than a support contract. But still, I’m young and work at a hot startup; when I’m CTO of a bank maybe my view will change
Ha! First day of my long awaited vacation and what do I do? Write a blog post about stuff I do at work of course!
A good portion of our team prefers to interface with Hive programatically using the Hive Thrift Server
The more we rely on it, the more we need to harden it.
It is not really setup or packaged for this so we need to go to town on it.
So, I started playing with a beta of Brisk this weekend.
The Datastax guys are industrious, energentic and are very open to hearing from both the Cassandra and Hadoop communities. You should hit them in #Datastax-Brisk on Freenode IRC.
I’ll post more on my benchmarks and tests later, I’m still getting comfortable with it, but it is still very familiar, already being a Hadoop and Cassandra user.
I need to setup the OpsCenter stuff which looks pretty cool and put some real data in it.
So far, my favorite thing:
INFO 23:36:22,093 Chose seed 192.168.x.x as jobtracker
My current concern is how to deal with deletes in CFS (CassandraFS) as Hive (and Terasort for that matter) kicks up a lot of ephemeral data. Cassandra doesn’t delete stuff instantly, so I imagine I’ll need to do some tweaking with
GCGraceSeconds to find an optimal setting.
So, this is my quick 5 minute setup to get going and running benchmarks.
SeaMicro has a pretty sweet looking product with their SM10000-64.
A while back I spoke to some SeaMicro sales guys and engineers and was pretty impressed. These guys know their stuff and many of them were movers and shakers at Force10, Brocade, AMD, Sun, Cisco and Juniper.
With that pedigree it seems a foregone conclusion they’d be able to come up with the new hotness in systems design.
So, what problem are they REALLY trying to solve? In their own words:
“Historically, servers were designed to quickly solve a relatively small number of very hard problems. The Internet, however, changed this. In the Internet data center, the challenge is to handle millions of relatively small, independent tasks like those needed for searching, social networking, viewing web pages, and checking email. Volume servers failed to adapt to this fundamental change.”
What did they do?
With some magic ASIC design they crammed 256 netbooks into 10U to save power and bet that the simpler Atom CPUs will be enough to handle the workloads we see in the serving layer.
DataStax (née Riptano) is to Cassandra as Cloudera is to Hadoop (or Redhat is to Linux).
Brisk is DataStax’s upcoming Cassandra/Hadoop hybrid distribution. From thier site:
DataStax’ Brisk is an enhanced open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by the Cassandra DB.
They added Cassandra as an option for the Hadoop storage layer, allowing you to bypass HDFS; however, the implications of that go a whole lot further. You get the strengths of both systems here and lose some of the problems.
I’m pretty jazzed about this and I hope to convince my co-workers to give it a go. I’d like to tell you why.
World Backup Day was last Thursday and in its honor I uploaded a few of my backup scripts to my github repository.
I thought I’d start off with modified versions of the scripts I use in production at Outbrain to backup my Hadoop NameNode and Hive Metastore.
First: OMFG WTF ARE YOU NOT BACKING UP YOUR NAMENODE AND HIVE METASTORE?
Second: No really, WTF IS WRONG WITH YOU!?!
Over the last few months we’ve been migrating our infrastructure over to the Chef platform for infrastructure automation. It is analogous to Puppet, which I’ve tinkered with in the past.
I’ll skip the debate over which is the better tool. There has been lots of discussion all over about it. Suffice it to say, we chose Chef for a myriad of reasons and this post isn’t a case study.
My first big chef project was migrating our Hadoop cluster on to it.