At Outbrain, I’ve recently been tasked with setting up and testing Facebook’s Scribe log aggregation server for collecting clicks, impressions and other data for eventual loading into our data warehouse.
From the README:
Scribe is a server for aggregating log data that’s streamed in realtime from clients. It is designed to be scalable and reliable.
Facebook Scribe can be found here.
Here is an ancient access_log from from a Sumerian web server dating from the 26th century BC. If you read cuneiform, you’d clearly see the entries from when Enki thought it was a cool idea to release his code red nam shub on the world.
Obviously, if they had proper, scalable log aggregation and analytics they might have nipped that in the bud before it turned into the great pre-biblical DDoS.
From reading around the web I have gathered that building Scribe is notoriously difficult and I’ve found a few installation guides, but mostly for less package-conservative linux distributions than CentOS. The steps I outline below for building and installation are what worked for me, and assume you have the EPEL repository installed.
NOTE: Silas Sewell has posted some wonderful source RPMS for CentOS 5 / RHEL 5 on his blog which I will elaborate on in a future post. My only problem with them is that thrift is built –without-java –without-perl –without-ruby –without-csharp for various reasons. I did steal his Boost 1.36 to 1.33 hack and init scripts for this build though.
So, without further ado let’s get started.
Continue reading →