Hive is a pretty nifty data warehousing extension of Hadoop that lets you dump structured data into HDFS and query it using a SQL-like language called HiveQL which runs all the map/reduce junk for you.
It’s pretty darn simple to install, but if you want to really free it up you need to do some tweaking.
Out of the box Hive keeps the information about what data lives where in HDFS in an embedded Apache Derby database (a.k.a. the metastore). This can be restrictive if you want to have more than one connection to Hive at a time. If you want to have two sessions to Hive CLI, or run the Hive Web Interface (HWI) simultaneously with a CLI session, you’re SOL.
This is pretty easy to fix, though, by using MySQL (which can handle multiple connections) to hold the metastore.
rpm -ivh http://archive.cloudera.com/redhat/cdh/unstable/RPMS/noarch/hadoop-hive-0.5.0+20-1.noarch.rpm
Now lets modify our /etc/hive/conf/hive-site.xml to use MySQL instead of Derby:
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hivemeta.yourcompany.com:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hadoop</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hadoop</value> </property>
Now let’s give Hive the MySQL connector:
curl http://mysql.he.net/Downloads/Connector-J/mysql-connector-java-5.1.10.tar.gz | tar zxv cp mysql-connector-java-5.1.10/mysql-connector-java-5.1.10-bin.jar /usr/lib/hive/lib/
Install the MySQL server (and set a root password) wherever you want to host it.
yum -y install mysql-server /usr/bin/mysqladmin -u root password '******'
Login to MySQL and setup the hadoop user. Not the most secure setup, but you get the idea. Hive makes the database automatically so you can restrict after it’s created.
mysql -u root -p CREATE USER 'hadoop'@'hivemeta.yourcompany.com' IDENTIFIED BY 'hadoop'; GRANT ALL PRIVILEGES ON *.* TO 'hivemeta.yourcompany.com' WITH GRANT OPTION;
And now we can give it a go:
(nathan@test1:~)$ hive Hive history file=/tmp/nathan/hive_job_log_nathan_201006211927_2002309384.txt hive> select * from dwcl_click limit 10; OK a bunch of junk here Time taken: 1.245 seconds
As long as it works, we know we can have multiple sessions. I’ll write soon about setting up the Hive Web Interface (HWI) as well as daemonizing Hive’s thrift server so we can really get the Hive humming.