
Thought I’d play a little with Hadoop 0.23 (a.k.a YARN, MR2, NextGen Hadoop) and dump my notes here.
Gotta keep my skillz sharp y’all so I don’t become irrelephant. (Yes, that just happened.)
Below I just setup a pseudo-distributed mode setup and run some examples on it, nothing crazy.
I’m hoping to test and write more on how 0.23 differs from the main line 0.20.x, 1.0 and CDH3 releases as well as playing with the NameNode federation and using some other paradigms like MPI, Hama and Spark.
Grab the tarball. I’ll put it in /opt for now:
cd /opt
curl http://mirror.atlanticmetro.net/apache//hadoop/common/hadoop-0.23.0/hadoop-0.23.0.tar.gz | tar zxv
ln -s hadoop-0.23.0 hadoop
Create the working directories.
mkdir -p /opt/hadoop/dfs/{name,data}
mkdir -p /opt/hadoop/mapred/{temp,local}
Drop some basic configurations.
core-site.xml
cat >/opt/hadoop/conf/core-site.xml <<CORE_EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
CORE_EOF
hdfs-site.xml
cat >/opt/hadoop/conf/hdfs-site.xml <<HDFS_EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/dfs/data</value>
</property>
</configuration>
HDFS_EOF
mapred-site.xml
cat >/opt/hadoop/conf/mapred-site.xml <<MAPRED_EOF
<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>/opt/hadoop/mapred/temp</value>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/opt/hadoop/mapred/local</value>
</property>
</configuration>
MAPRED_EOF
yarn-site.xml
cat >/opt/hadoop/conf/yarn-site.xml <<YARN_EOF
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
YARN_EOF
Copy the metrics config over.
cp etc/hadoop/hadoop-metrics* conf/
Setup the environment(your JAVA_HOME may be different, I’m testing on debian).
export JAVA_HOME=/usr/lib/jvm/java-6-sun/
export HADOOP_HOME=/opt/hadoop/
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/conf/
export YARN_CONF_DIR=~${HADOOP_HOME}/conf/
Format your namenode.
./bin/hdfs namenode -format
Start up the HDFS daemons.
./sbin/hadoop-daemon.sh namenode start
./sbin/hadoop-daemon.sh datanode start
Fire up the YARN daemons.
- ResourceManager is analagous to the JobTracker.
- NodeManager is analagous to the TaskTracker.
- JobHistoryServer gives you a better interface to job histories than the JobTracker did.
More on the YARN architecture here
./bin/yarn-daemon.sh start resourcemanager
./bin/yarn-daemon.sh start nodemanager
./bin/yarn-daemon.sh start historyserver
Make sure everything is up and running.
# jps
12370 Jps
11057 NameNode
11231 DataNode
12053 JobHistoryServer
11875 ResourceManager
12284 NodeManager
Web interfaces:
- NammeNode: http://localhost:50070/dfshealth.jsp
- ResourceManager: http://localhost:8088/cluster
- JobHistory: http://localhost:19888/jobhistory
- NodeManager http://localhost:9999/node
Now to do some light work.
Calculate yourself some Pi:
./bin/hadoop jar ./hadoop-mapreduce-examples-0.23.0.jar pi 10 10000
And while you’re at it, pimp yourself a wordcount.
wget http://www.gutenberg.org/cache/epub/779/pg779.txt -O /tmp/faustus.txt
./bin/hadoop fs -mkdir /tmp
./bin/hadoop fs -copyFromLocal /tmp/faustus.txt /tmp/faustus.txt
./bin/hadoop jar ./hadoop-mapreduce-examples-0.23.0.jar wordcount /tmp/faustus.txt /tmp/faustus.out
./bin/hadoop fs -cat /tmp/faustus.out/part-r-00000
Next time I’ll make a proper cluster and play with some of the more whizzbang features and maybe run some terasorts for fun