So, I started playing with a beta of Brisk this weekend.
The Datastax guys are industrious, energentic and are very open to hearing from both the Cassandra and Hadoop communities. You should hit them in #Datastax-Brisk on Freenode IRC.
I’ll post more on my benchmarks and tests later, I’m still getting comfortable with it, but it is still very familiar, already being a Hadoop and Cassandra user.
I need to setup the OpsCenter stuff which looks pretty cool and put some real data in it.
So far, my favorite thing:
INFO 23:36:22,093 Chose seed 192.168.x.x as jobtracker
Magic!
My current concern is how to deal with deletes in CFS (CassandraFS) as Hive (and Terasort for that matter) kicks up a lot of ephemeral data. Cassandra doesn’t delete stuff instantly, so I imagine I’ll need to do some tweaking with GCGraceSeconds to find an optimal setting.
So, this is my quick 5 minute setup to get going and running benchmarks.
You’ll need EPEL:
rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm
Setup the Datastax Repository:
echo "[datastax]
name=DataStax Repo for Apache Cassandra
baseurl=http://rpm.datastax.com/EL/5/
enabled=1
gpgcheck=0" > /etc/yum.repos.d/datastax.repo
Install Brisk from yum (it gets most everything you might need, even JNA!)
yum -y install brisk-full
I like to use MX4J in production:
wget http://sourceforge.net/projects/mx4j/files/MX4J%20Binary/3.0.2/mx4j-3.0.2.tar.gz/download
tar zxvf mx4j-3.0.2.tar.gz mx4j-3.0.2/lib/mx4j-tools.jar
cp mx4j-3.0.2/lib/mx4j-tools.jar /usr/share/brisk/cassandra/lib/
chown cassandra:cassandra /usr/share/brisk/cassandra/lib/mx4j-tools.jar
chmod 755 /usr/share/brisk/cassandra/lib/mx4j-tools.jar
Actually setup the Cassandra component of the cluster:
Set your initial tokens:
sed -i "s,initial_token:,initial_token: 0," \
/etc/brisk/cassandra/cassandra.yaml
Change the cluster name:
sed -i "s,cluster_name: 'Test Cluster',cluster_name: 'Brisk Test'," \
/etc/brisk/cassandra/cassandra.yaml
Set your seeds:
sed -i 's/- seeds: "127.0.0.1"/- seeds: "192.168.1.1,192.168.1.2"/' \
/etc/brisk/cassandra/cassandra.yaml
Set the addresses you want to bind to:
sed -i "s,listen_address: localhost,listen_address:," \
/etc/brisk/cassandra/cassandra.yaml
sed -i "s,rpc_address: localhost,rpc_address: 0.0.0.0," \
/etc/brisk/cassandra/cassandra.yaml
Now for the Hadoop component setup:
For the sake of this setup I’m not spliting the cluster into two halves, only need a Hadoop half.
Enable the Nodes as Brisk nodes
sed -i "s,HADOOP_ENABLED=0,HADOOP_ENABLED=1," \
/etc/default/brisk
My config is very simple with the data living on a 5 drive stripe. For the terasort you’re going to want to setup a mapred.local.dir that has some room in it, as the default in /tmp will fill up pretty quick. I’m just dumping it on the same place my CFS data is, not perfect for performance, but good enough for my testing.
mkdir -p /var/lib/cassandra/mapred/local
chown cassandra:cassandra /var/lib/cassandra/mapred/local
Also, we’ll want to set more map and reduce slots. Setting these is a balance of the ammount of RAM, drives and cores the node has. More info can be found here.
vim /etc/brisk/hadoop/mapred-site.xml
<property>
<name>mapred.local.dir</name>
<value>/var/lib/cassandra/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>8</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>8</value>
</property>
And that should be enough to get you going, fire it up thus:
/etc/init.d/brisk start
Michael Noll has a fantastic blog article on running some of these benchmarks here.
One of the tests usually I run across my regular Hadoop cluster after a restart or upgrade to make sure it is operating is the pi calc:
brisk hadoop jar /usr/share/brisk/hadoop/lib/hadoop-examples*.jar \
pi 10 10000
And here is how you can run your TeraSort:
TeraGen:
brisk hadoop jar /usr/share/brisk/hadoop/lib/hadoop-examples*.jar \
teragen 1000000000 /tmp/terasort-input
TeraSort:
brisk hadoop jar /usr/share/brisk/hadoop/lib/hadoop-examples*.jar \
terasort /tmp/terasort-input /tmp/terasort-output
TeraValidate:
brisk hadoop jar /usr/share/brisk/hadoop/lib/hadoop-examples*.jar \
teravalidate /tmp/terasort-output /tmp/terasort-validate
Cleanup:
brisk hadoop fs -rmr cfs:///tmp/tera*
And thats it. I’ve done a small 100G terasort and it came off well. Will do the real 1TB terasort in the next few days and maybe mrbench, then setup CDH3 on the same hardware and do the same.
So far, I’m pleased with it. For beta software there are only a few rough edges. Everything works out of the box and is much easier to get going than a regular Hadoop cluster.