Productionizing the Hive Thrift Server.

Ha! First day of my long awaited vacation and what do I do? Write a blog post about stuff I do at work of course!

A good portion of our team prefers to interface with Hive programatically using the Hive Thrift Server

The more we rely on it, the more we need to harden it.

It is not really setup or packaged for this so we need to go to town on it.

Before I’ve written how I’ve daemonized the Hive Thrift server and how setup MySQL as your Hive Metastore.

Since I run different Hive Thrift daemons on different ports I add HIVE_PORT=10001 (or whatever port) as suggested in the Hive Wiki to the different init scripts.

Next we need to add some job control. We use the Fair Schduler which allows us to allocate guarantees to jobs.

One problem here is that you cannot pass -HiveConf parameters when you spawn Hive with the hivesever option so you cannot set the Fair Scheduler pool name that way.  You can only do it by configuring the Fair Scheduler to use the unix/linux user name the process is run as as the name of the pool.  You can do this in mapred-site.xml (after adding the other Fair Scheduler setup stuff) like so:

<property>
  <name>mapred.fairscheduler.poolnameproperty</name>
  <value>user.name</value>
</property>

Then making sure your hive thrift instances run as that user.  Just create the user, then set the  init scripts from my other article to use that user.

In one of our cases, our highest priority pool (the one with Min Share Preemption enabled) is called hive-primary.   So, we have a user called hive-primary that the init script runs as and fair-scheduler.xml has a pool that looks like:

<pool name="hive-primary">
  <minMaps>115</minMaps>
  <minReduces>85</minReduces>
  <minSharePreemptionTimeout>60</minSharePreemptionTimeout>
  <schedulingMode>fair</schedulingMode>
  <weight>2.0</weight>
</pool>

Now, in any production environment, one is never enough so we need a few of these sprinked around in case one goes down we don’t have any serious disruption. I’ve built a Chef recipe that easily sets all this up at the push of a button.  I’ll push it to Github later on but for your needs, just install Hive, and place your hive and hadoop configs to the nodes in question.

Next we need to ensure that they’re up, which is pretty easy using Nagios. Just setup a tcp check for your port(s) like so:

check_command check_tcp!10000

Now, the last step, make it highly avaliable with HAProxy.

HAProxy is pretty easy to get going and there are plenty of tutorials on getting it installed and setup that I will not replicate.   We have several HAProxy nodes setup to share an address with some keepalived magic so we have another layer of redundancy/avaliability.

Once you got it up just add your Hive Thrift servers to the haproxy.cfg thus:

listen hive-primary :10000
balance leastconn
mode tcp
server hivethrift1 192.168.1.101:10000 check
server hivethrift2 192.168.1.102:10000 check
server hivethrift3 192.168.1.103:10000 check

And that is the bulk of it.

Giving it a go is pretty straight forward. If you’re a Ruby person, you should checkout RBHive.

Assuming my haproxy/keepalived address is at hivethrift.example.com

nathan@citadel ~ $ irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'rbhive'
=> true
irb(main):003:0> RBHive.connect('hivethrift.example.com') do |connection|
irb(main):004:1*   connection.fetch 'show tables'
irb(main):005:1> end
Connecting to hivethrift.example.com on port 10000
Executing Hive Query: show tables
=> [ {:tab_name=>"example_table"}]

Hive is kinda messy and leaves lots of junk in /tmp on the local disk and in HDFS’s /tmp. I’ve got some scripts to automate the cleanup, one of them is on github, but has some flaws. Like with the Chef recipe I’ll fix it and post it all when I get back from vacation and update this post.

In the end, our only single points of failure are the MySQL metastore (can be overcome with replication, but I hope to move it to Cassandra as with Brisk) and the regular Hadoop pain points with the single NameNode and JobTracker.

Share
  • Bennie

    Dear Sir/Madam,

    This is Bennie from EaseUS
    Software Company.

    I write this in hope that you
    can giveaway our products by writing a review, and we will offer a certain
    amount of the licenses of our products for your visitors in return.

    We plan to Giveaway EaseUS Todo Backup($39.00), but if your visitors
    are more interested in our other products, such as EaseUS Partition Master, and
    EaseUS Data Recovery Wizard, we can also Giveaway them.

    This Giveaway will also
    attract more visitors to your site.

    Please consider this.

    Look forward to your reply.

    Have a nice day!

     

    Best regards

    Bennie

    Chengdu Yiwo Tech Co., Ltd.

       http://www.easeus.com

       http://www.partition-tool.com

       http://www.todo-backup.com

  • Bennie

    Hi Nathan,
    This is Bennie from EaseUS
    Software Company.

    I write this in hope that you
    can giveaway our products by writing a review, and we will offer a certain
    amount of the licenses of our products for your visitors in return.

    We plan to Giveaway EaseUS Todo Backup($39.00), but if your visitors
    are more interested in our other products, such as EaseUS Partition Master, and
    EaseUS Data Recovery Wizard, we can also Giveaway them.

    This Giveaway will also
    attract more visitors to your site.

    Please consider this.

    Look forward to your reply.

    Have a nice day!

     

    Best regards

    Bennie

    Chengdu Yiwo Tech Co., Ltd.

       http://www.easeus.com

       http://www.partition-tool.com

       http://www.todo-backup.com