At Outbrain, I’ve recently been tasked with setting up and testing Facebook’s Scribe log aggregation server for collecting clicks, impressions and other data for eventual loading into our data warehouse.
From the README:
Scribe is a server for aggregating log data that’s streamed in realtime from clients. It is designed to be scalable and reliable.
Facebook Scribe can be found here.
Here is an ancient access_log from from a Sumerian web server dating from the 26th century BC. If you read cuneiform, you’d clearly see the entries from when Enki thought it was a cool idea to release his code red nam shub on the world.

Obviously, if they had proper, scalable log aggregation and analytics they might have nipped that in the bud before it turned into the great pre-biblical DDoS.
From reading around the web I have gathered that building Scribe is notoriously difficult and I’ve found a few installation guides, but mostly for less package-conservative linux distributions than CentOS. The steps I outline below for building and installation are what worked for me, and assume you have the EPEL repository installed.
NOTE: Silas Sewell has posted some wonderful source RPMS for CentOS 5 / RHEL 5 on his blog which I will elaborate on in a future post. My only problem with them is that thrift is built –without-java –without-perl –without-ruby –without-csharp for various reasons. I did steal his Boost 1.36 to 1.33 hack and init scripts for this build though.
So, without further ado let’s get started.
# Create important directories:
mkdir -p ~/build/
cd ~/build/
mkdir /etc/scribed/
# Install whatever dependencies we can from yum.
yum -y install gcc-c++ boost boost-devel libevent libevent-devel \
automake autoconf m4 bison zlib zlib-devel bzip2 \
bzip2-devel flex pkgconfig python-devel ruby-devel
# Thrift requires a newer version of libtool to build, so remove CentOS’s and build anew:
rpm -qa | grep libtool | rpm -e --nodeps $(xargs)
curl http://ftp.gnu.org/gnu/libtool/libtool-2.2.8.tar.gz | tar zxv
cd libtool-2.2.8/ && ./configure && make && make install && cd ..
# Get a modern version of Java (gotta get it from Oracle directly… that feels weird to type)
rpm -ivh /mnt/temp/rpms/jdk-6u20-linux-amd64.rpm
echo 'export JAVA_HOME="/usr/java/jdk1.6.0_20"' > /etc/profile.d/java.sh
echo 'PATH=${JAVA_HOME}/bin:${PATH}' >> /etc/profile.d/java.sh
export JAVA_HOME=/usr/java/jdk1.6.0_20
export PATH=${JAVA_HOME}/bin:${PATH}
# Thrift requires a new version of ant
curl http://www.fightrice.com/mirrors/apache/ant/binaries/apache-ant-1.8.1-bin.tar.gz | tar zxv
mv apache-ant-1.8.1/ /opt/ant
echo 'export ANT_HOME=/opt/ant' > /etc/profile.d/ant.sh
echo 'export PATH=/opt/ant/bin:$PATH' >> /etc/profile.d/ant.sh
export ANT_HOME=/opt/ant
export PATH=/opt/ant/bin:$PATH
# Build & install thrift 0.2.0
curl http://mirror.atlanticmetro.net/apache/incubator/thrift/0.2.0-incubating/thrift-0.2.0-incubating.tar.gz | tar zxv
cd thrift-0.2.0
cp /usr/share/aclocal/pkg.m4 aclocal/
./bootstrap.sh && ./configure --with-csharp=no --with-erlang=no --with-ruby=no
make && make install
# Build & install Facebook Bassline
cd contrib/fb303/
./bootstrap.sh && ./configure
make && make install && cd ../../../
# Finally, build & install scribe after modifying it to use boost 1.33 (Thanks Silas!):
curl http://cloud.github.com/downloads/facebook/scribe/scribe-2.1.tar.gz | tar zxv
cd scribe-2.1
export LD_LIBRARY_PATH="/usr/local/lib"
sed -i 's/1.36/1.33/' configure.ac
sed -i 's/dir_iter->filename/dir_iter->leaf/' src/file.cpp
./bootstrap.sh && ./configure && make && make install
Depending on the version of thrift installed you may need to execute the following on the scribe_cat script.
sed -i 's/log_entry = scribe.LogEntry(dict(category=category, message=sys.stdin.read()))/log_entry = scribe.LogEntry(category=category, message=sys.stdin.read())/' ./examples/scribe_cat
# Copy the example stuff out:
cp ./examples/scribe_cat /usr/local/bin
cp ./examples/scribe_ctrl /usr/local/bin
cp ./examples/example1.conf /etc/scribed/default.conf
# Here is the init script, /etc/init.d/scribed
#!/bin/sh
#
# scribed - this script starts and stops the scribed daemon
#
# chkconfig: - 84 16
# description: Scribe is a server for aggregating log data \
# streamed in real time from a large number of \
# servers.
# processname: scribed
# config: /etc/scribed/scribed.conf
# config: /etc/sysconfig/scribed
# pidfile: /var/run/scribed.pid
# Source function library
. /etc/rc.d/init.d/functions
run="/usr/local/bin/scribed"
run_ctrl="/usr/local/bin/scribe_ctrl"
prog=$(basename $run)
[ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog
port=$(egrep "^port=" $SCRIBED_CONFIG | awk -F"=" '{ print $2 }')
lockfile=/var/lock/subsys/scribed
start() {
echo -n $"Starting $prog: "
daemon nohup $run -c $SCRIBED_CONFIG &> /dev/null &
retval=$?
echo
[ $retval -eq 0 ] && touch $lockfile
return $retval
}
stop() {
echo -n $"Stopping $prog: "
$run_ctrl stop $port
retval=$?
echo
[ $retval -eq 0 ] && rm -f $lockfile
return $retval
}
status() {
$run_ctrl status $port
}
restart() {
stop
start
}
reload() {
echo "Probably not implemented."
$run_ctrl reload $port
}
case "$1" in
start|stop|restart|status|reload)
$1
;;
*)
echo $"Usage: $0 {start|stop|status|restart|reload}"
exit 2
esac
# Set it up :
chmod +x /etc/init.d/scribed
chkconfig --add scribed
# Create the /etc/sysconfig file
echo "SCRIBED_CONFIG=/etc/scribed/default.conf" >> /etc/sysconfig/scribed
I’ll post more about Silas’ RPMs, Digg’s scribe-log4j-appender and actual server configuration so stay tuned!