Brisk is DataStax’s upcoming Cassandra/Hadoop hybrid distribution. From thier site:
DataStax’ Brisk is an enhanced open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by the Cassandra DB.
They added Cassandra as an option for the Hadoop storage layer, allowing you to bypass HDFS; however, the implications of that go a whole lot further. You get the strengths of both systems here and lose some of the problems.
I’m pretty jazzed about this and I hope to convince my co-workers to give it a go. I’d like to tell you why.
Reason the First: Removal of the Namenode.
Don’t get me wrong. It is a trumph and the culmination of years of engineering. However, it is a glaring single point of failure and has hard limits as to how far it can scale. If Hadoop is in your critical path, the idea of a corrupt fsimage file gives you cold chills.
When it is down… so are you.
- It is not highly available out of the box.
- There is no hot stand-by Namenode.
- It has a small file/block size problem.
- It has other general scaling problems.
My previous post is a funny example of a worst case scenario.
Cassandra has no single point of failure, no coordinator node. All nodes are equal.
Reason the Second: Speeding up analytics.
We’ve got scads of algorithims and a few classes of logs we collect from our API servers. Currently we generate 4 logs (different types) per API server per hour that are of interest to us.
Some logs are ~400KB and some are ~100MB, but our average ‘in HDFS’ file size is 5MB. This is NOT awesome in terms of scaling in light of the small files problem. Depending on our retention needs and how many new API servers we add, we’re going to zap our Namenode one day.
Use Hadoop Archives, you say?
Well yeah, that might help the Namenode, but we also have issues in terms of query efficiency. We still need to traverse all those files within the HAR.
Querying one type of data means one type of log we collect, and we’re doing one of these logs per API server per hour.
If R&D runs a Hive job on the last 30 days of a particular table, over data on 100 servers, we’re looking at 72000 maps (720 hours * 100 servers) at minimum and much more depending on the query’s complexity.
This is a gross oversimplification, but you get the idea.
More small files means more DFS blocks; means more maps; more maps means longer jobs; longer jobs means delay.
Right now, we’re hoping to improve this by using Cloudera’s Flume to stream all logs from all servers into a single aggregated hourly file. (That’s my next big project; in fact, stay tuned for Flume blog posts in the coming months).
We’re also kicking around the idea of migrating to Hbase to increase query speed.
Our logs are, for all intents and purposes, structured data (TSV’s) that could realistically be made to fit into a more structured, lower latency system like Hbase.
Hbase and Cassandra share a lot of similarities in terms of data modeling….so if you’re thinking Hbase, you could just as easily think Brisk.
Reason the Third: Potential removal of the memcached layer.
This one is iffy (and Cassandra-generic rather than Brisk-specific), and requires more research in terms of implementation on my part, but deserves serious consideration.
Edward Capriolo gave a talk the other night at the NYC Cassandra Meetup group (which I missed!) entitled Cassandra as Memcache where he gives a compelling case for obviating the use of memcached in front of Cassandra.
The slides are availaible here and I’m hoping a video of the talk surfaces soon.
He gives four configurations utilizing different replication factors, consistency levels, row and key caches with different tradeoffs, like:
You really should grab the slides. Lots to think about there, and you’ve also got to respect a guy who posts his slides in an Open/LibreOffice format
Reason the Fourth: Less shipping of data from the primary serving data store to the data warehouse.
Here is a VERY rough idea what one of our processes looks like (in sweet psuedo-BASIC):
10 API servers serve up recomendations from Cassandra.
20 API server generates logs of those transactions.
30 Logs are shipped to Hadoop.
40 Algo run across that data via Hive.
50 Algo results decide on good recs and are shipped to Cassandra.
60 go to 10
This also speaks to analytics speed. The feedback cycles for many of our algorithms traverse several layers of machines and some data needs to be transformed or processed before it can be fed to our API servers.
Some data can take a few hours to migrate up to the business folks because of this.
One VERY cool thing you can do is with Brisk is partition your cluster (Cassandra is built to do this very simply and seamlessly for spanning multiple datacenters), causing replica placement of cluster data to be evenly distributed between the two cluster halves.
THEN you can use one side for serving and one side for analytics, allowing data to move back and forth IN REAL TIME. DataStax puts a nice illustration of this in their whitepaper:
So here is my dream of simplicity:
Serving Layer <-> Bunch of Cassandra servers running Brisk.
and that is it.
From an Operations perspective, this is fantastic since it reduces the classes of hardware I need for this part of our infrastructure to two. It also makes us more elastic: add a Cassandra node and a few Tomcat nodes, et voilà.
Workflow might be something like this:
- Recommendation Keyspace has a very high replication factor across the cluster and a large row cache to act like memcached.
- Data in the Recommendations Keyspace can also be given TTLs so stale recs can expire.
- Log Keyspace has a replication factor of two (maybe more to increase map/reduce concurrency) so it is split evenly into the analytics side of the cluster.
10 API server serves up recs from Recommendation Keyspace.
20 API server pipes log data right into a Logs Keyspace in Cassandra
30 Hive Job run across Logs Keyspace and algorithm updates the Recommendations Keyspace accordingly.
40 Go to 10.
You can see the efficiency. Cassandra is FANTASTIC at writes, and using the same TCP session from steps 10 & 20 might cause less network overhead too.
And perhaps we could use Flume to siphon log data off of nodes or off the cluster itself to an offline archive.
Less shuffling of data around. No clumsy scripts that rely on NFS or SCP and shared keys.
Also, business data can be queried as it comes in.
Will this work?
Heck if I know. Brisk isn’t out yet so it is untested (publicly). It could grind to a halt, but I’d love to give it a shot.
Soo… it’s definelty a union of technologies I am looking forward to.