Why I am very excited about DataStax’s Brisk.

DataStax (née Riptano) is to Cassandra as Cloudera is to Hadoop (or Redhat is to Linux).

Brisk is DataStax’s upcoming Cassandra/Hadoop hybrid distribution.  From thier site:

DataStax’ Brisk is an enhanced open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by the Cassandra DB.

They added Cassandra as an option for the Hadoop storage layer, allowing you to bypass HDFS; however, the implications of that go a whole lot further.  You get the strengths of both systems here and lose some of the problems.

I’m pretty jazzed about this and I hope to convince my co-workers to give it a go. I’d like to tell you why.

Reason the First:  Removal of the Namenode.

The Hadoop Namenode is a scary, brittle thing.

Don’t get me wrong.  It is a trumph and the culmination of years of engineering. However, it is a glaring single point of failure and has hard limits as to how far it can scale. If Hadoop is in your critical path, the idea of a corrupt fsimage file gives you cold chills.

When it is down… so are you.

  • It is not highly available out of the box.
  • There is no hot stand-by Namenode.
  • It has a small file/block size problem.
  • It has other general scaling problems.

My previous post is a funny example of a worst case scenario.

Cassandra has no single point of failure, no coordinator node.  All nodes are equal.

Reason the Second: Speeding up analytics.

We’ve got scads of algorithims and a few classes of logs we collect from our API servers. Currently we generate 4 logs (different types) per API server per hour that are of interest to us.

Some logs are ~400KB and some are ~100MB, but our average ‘in HDFS’ file size is 5MB. This is NOT awesome in terms of scaling in light of the small files problem. Depending on our retention needs and how many new API servers we add, we’re going to zap our Namenode one day.

Use Hadoop Archives, you say?

Well yeah, that might help the Namenode, but we also have issues in terms of query efficiency.  We still need to traverse all those files within the HAR.

Querying one type of data means one type of log we collect, and we’re doing one of these logs per API server per hour.

If R&D runs a Hive job on the last 30 days of a particular table, over data on 100 servers, we’re looking at 72000 maps (720 hours * 100 servers) at minimum and much more depending on the query’s complexity.

This is a gross oversimplification, but you get the idea.

More small files means more DFS blocks; means more maps; more maps means longer jobs; longer jobs means delay.

Right now, we’re hoping to improve this by using Cloudera’s Flume to stream all logs from all servers into a single aggregated hourly file. (That’s my next big project; in fact, stay tuned for Flume blog posts in the coming months).

We’re also kicking around the idea of migrating to Hbase to increase query speed.

Our logs are, for all intents and purposes, structured data (TSV’s) that could realistically be made to fit into a more structured, lower latency system like Hbase.

Facebook does something like this, but uses Scribe instead of Flume.

Hbase and Cassandra share a lot of similarities in terms of data modeling….so if you’re thinking Hbase, you could just as easily think Brisk.

Reason the Third: Potential removal of the memcached layer.

This one is iffy (and Cassandra-generic rather than Brisk-specific), and requires more research in terms of implementation on my part, but deserves serious consideration.

Edward Capriolo gave a talk the other night at the NYC Cassandra Meetup group (which I missed!) entitled Cassandra as Memcache where he gives a compelling case for obviating the use of memcached in front of Cassandra.

The slides are availaible here and I’m hoping a video of the talk surfaces soon.

He gives four configurations utilizing different replication factors, consistency levels, row and key caches with different tradeoffs, like:

You really should grab the slides.  Lots to think about there, and you’ve also got to respect a guy who posts his slides in an Open/LibreOffice format :)

Reason the Fourth: Less shipping of data from the primary serving data store to the data warehouse.

Here is a VERY rough idea what one of our processes looks like (in sweet psuedo-BASIC):

10 API servers serve up recomendations from Cassandra.
20 API server generates logs of those transactions.
30 Logs are shipped to Hadoop.
40 Algo run across that data via Hive.
50 Algo results decide on good recs and are shipped to Cassandra.
60 go to 10

This also speaks to analytics speed.  The feedback cycles for many of our algorithms traverse several layers of machines and some data needs to be transformed or processed before it can be fed to our API servers.

Some data can take a few hours to migrate up to the business folks because of this.

One VERY cool thing you can do is with Brisk is partition your cluster (Cassandra is built to do this very simply and seamlessly for spanning multiple datacenters), causing replica placement of cluster data to be evenly distributed between the two cluster halves.

THEN you can use one side for serving and one side for analytics, allowing data to move back and forth IN REAL TIME.  DataStax puts a nice illustration of this in their whitepaper:

So here is my dream of simplicity:

Serving Layer <-> Bunch of Cassandra servers running Brisk.

and that is it.

From an Operations perspective, this is fantastic since it reduces the classes of hardware I need for this part of our infrastructure to two.  It also makes us more elastic:  add a Cassandra node and a few Tomcat nodes, et voilà.

Workflow might be something like this:

  • Recommendation Keyspace has a very high replication factor across the cluster and a large row cache to act like memcached.
  • Data in the Recommendations Keyspace can also be given TTLs so stale recs can expire.
  • Log Keyspace has a replication factor of two (maybe more to increase map/reduce concurrency) so it is split evenly into the analytics side of the cluster.

10 API server serves up recs from Recommendation Keyspace.
20 API server pipes log data right into a Logs Keyspace in Cassandra
30 Hive Job run across Logs Keyspace and algorithm updates the Recommendations Keyspace accordingly.
40 Go to 10.

You can see the efficiency. Cassandra is FANTASTIC at writes, and using the same TCP session from steps 10 & 20 might cause less network overhead too.

And perhaps we could use Flume to siphon log data off of nodes or off the cluster itself to an offline archive.

Less shuffling of data around.  No clumsy scripts that rely on NFS or SCP and shared keys.

Also, business data can be queried as it comes in.

Will this work?

Heck if I know. Brisk isn’t out yet so it is untested (publicly). It could grind to a halt, but I’d love to give it a shot.

Soo… it’s definelty a union of technologies I am looking forward to.

 

Share
  • zznate

    This Cassandra Flume plugin might be of interest: https://github.com/thobbs/flume-cassandra-plugin

    Cheers,
    -zznate

    • http://blog.milford.io Nathan Milford

      I’ve been thinking along those lines. I’ve still got to sell this to the boss :)

      BTW, I’m using your fantastic cassandra-stress project right now. I’m literally at that datacenter installing hardware for a new Cassandra cluster that I plan to unleash cassandra-stress on.

      Brilliant work there, thanks for it.

      • zznate

        Awesome! Stress.java has come a long way since I started cassandra-stress though. I mostly just use cassandra-stress now for testing connection failover situations and JMX plumbing in Hector. However, if any feature ideas come up, let me know.

  • Edward Capriolo

    This is very cool stuff. I can not wait to try out Brisk. Brisk removing all SPOF from hadoop and making the NN scalable would be amazing. Currently our NN has a huge heap ~13GB and we are stuck around 10,000,000 blocks. Being able to have a scalable NN would allows us to get our file system bigger WITHOUT sacrificing small blocks.

    • http://blog.milford.io Nathan Milford

      I’m right there with you. Just one less thing to keep me up at night :)

      Let me know when you got video of your talks. I’m anxious to view them.

      • Ben Holloway

        Do you still use brisk now that they’ve effectively switched to a closed source model?

    • http://otis.myopenid.com/ Otis Gospodnetic

      @Edward but does Brisk also eliminate the JT SPOF? Or just the NN SPOF?

      • http://blog.milford.io Nathan Milford

        @Otis I don’t see the JT as much of a problem as the NN being a SPOF. If you lose the NN you lose everything. If you lose the JT, you only lose the jobs currently running before you spawn the JT elsewhere and point DNS at it.

        I happened to be at a Datastax training session with Jake Luciani the day Brisk was announced and pinged him on that very subject and he said the JT was dynamic.

        From Datastax’s whitepaper:

        “On startup, all Brisk nodes automatically start a Hadoop task tracker, and one of the nodes is elected to be the job tracker. Brisk provides full data locality awareness for Hadoop task assignment.”

        • http://otis.myopenid.com/ Otis Gospodnetic

          Right, I noticed that “elected” bit, too, when I skimmed the paper the other day. It’s not clear from the paper what happens when that elected JT dies. Does a new one get transparently elected somehow?

          • http://blog.milford.io Nathan Milford

            I’m not sure yet. The docs don’t go that far. The impression that I got from Jake was that they did a lot of work on this and it is far more than a simple plugin of Cassandra, that the the binding of Cassandra to the MapRed layer was reasonably profound.

          • Jeremy Hanna1234

            Just checked – they had to do only so much for this first release, but there is a ticket to instantiate/auto-failover to another JT. If you think about how the original one is chosen, it seems like it’s pretty straightforward, but we’ll see how it comes along.

      • Edward Capriolo

        It does not look like it does. The JobTracker is not really a single point of failure. For example in a two-thousand-node cluster can have 2 JobTrackers. 1000 TaskTrackers nodes can be configured for one JobTracker, 1000 can be configured for the other.

  • http://blog.milford.io Nathan Milford

    I’m right there with you. Just one less thing to keep me up at night :)

    Let me know when you got video of your talks. I’m anxious to view them.

  • http://otis.myopenid.com/ Otis Gospodnetic

    @Nathan Thanks for the useful post! Some questions:

    Reason 1: Does it also get rid of JT SPOF? Sounds like only NN SPOF is eliminated, but JT SPOF remains?

    Reason 2: You mention speeding up analytics, but then talk about small files… is the connection your need to aggregate small files into bigger files in order to avoid having too many small files? And because you need to aggregate you need to introduce a delay during which new data is not visible to analytics? Can you solve that by simply storing your data in HBase, a row at a time, making each row immediately visible and letting HBase deal with HFiles and HDFS.

    Reason 3: No memcached. Couldn’t the same be done with HBase, which also provides fast key lookup?

    Reason 4: Would HBase cluster replication (not HDFS file replication, but the asynchronous, near real time HBase cluster replication) do the job here? Would that allow you to have a separate set of servers for analytics? For example, one could create external Hive tables and point them to HBase tables in this secondary cluster, thus separating ad-hoc Hive queries from data ingestion happening in the primary HBase cluster.

    Please don’t get me wrong – not saying Brisk is useless because HBase can do it all (it can’t avoid NN SPOF, for example), just trying to understand where Brisk introduces something really new and previously impossible.

    Thanks!

    • Edward Capriolo

      Reason 3: No. You can not do with HBase what I described in my slides. This is because in HBase a region can only be served by one server. With Cassandra multiple servers can actively serve the same data.

      Reason 4- HBase and it “near” real time asynchronous replication is not the same thing as Cassandra multiple different consistency levels. It cracks me up how Cassandra has been taking so much flak about “eventual consistency” but when the shoe is on the other foot it is suddenly a great feature.

      • http://otis.myopenid.com/ Otis Gospodnetic

        Re Reason 3: I see your slides now. Yeah, those different Cassandra configuration options sound quite versatile and powerful.

        Re Reason 4: Not the same, right, but I think the HBase cluster replication addresses Nathan’s case of not wanting to ship data around and into the Warehouse. With what I described above one doesn’t have to ship anything anywhere.

    • http://blog.milford.io Nathan Milford

      @Otis, thanks for stopping by. I was lucky enough to see your talk at Hadoop World in October :)

      1) Answered below.

      2) We’re considering Hbase and I’ve built it in my test environment but I have far less experience with it than I do with Cassandra.

      While we’re still using Hive on top of HDFS, small files are a problem for both HDFS and MapRed. In either case, if we migrated to HBase or Brisk thing would speed right up. My argument is that if you’re looking at how to model your data to fit into HBase, thinking about modeling for Cassandra/Brisk isn’t that that much of a leap.

      We use Cassandra in production, so if it fills a hole elsewhere then I’m all for it. All things being equal between Hbase and Brisk, for my environment, I’d opt for something my guys already know.

      3) I don’t know enough about the fine details of tuning Hbase, so you tell me. I’d certainly love to see a paper/presentation on it. If the logic of Edward’s Cassandra as Memcache presentation could be applied to HBase thats fantastic! I’m still skeptical if it will really work in production, at scale, but it looks like it should. I’d love to give it a shot in either case.

      4) Seems like a reasonable assumption that this could work in Hbase, but Brisk does (or the whitepaper says it will be able to do) this out of the box with little or no extraneous configuration. With options for corporate support for this specific configuration for nervous CTOs!

      In the end, Brisk isn’t a public product yet, so we just don’t know really what it is capable of. However, we do know what Cassandra and Hadoop/Hive are both capable of and can have reasonable expectations of what Brisk should be able to do.

      Hbase and other systems can probably do everything I’ve outlined here, but for me the removal of the Namenode is the real point.

  • http://blog.milford.io Nathan Milford

    @Otis, thanks for stopping by. I was lucky enough to see your talk at Hadoop World on October :)

    1) Answered below.

    2) We’re considering Hbase and I’ve built it in my test environment but I have far less experience with it than I do with Cassandra.

    While we’re still using Hive on top of HDFS, small files are a problem for both HDFS and MapRed. In either case, if we migrated to HBase or Brisk thing would speed right up. My argument is that if you’re looking at how to model your data to fit into HBase, thinking about modeling for Cassandra/Brisk isn’t that that much of a leap.

    We use Cassandra in production, so if it fills a hole elsewhere then I’m all for it. All things being equal between Hbase and Brisk, for my environment, I’d opt for something my guys already know.

    3) I don’t know enough about the fine details of tuning Hbase, so you tell me. I’d certainly love to see a paper/presentation on it. If the logic of Edward’s Cassandra as Memcache presentation could be applied to HBase thats fantastic! I’m still skeptical if it will really work in production, at scale, but it looks like it should. I’d love to give it a shot in either case.

    4) Seems like a reasonable assumption that this could work in Hbase, but Brisk does (or the whitepaper says it will be able to do) this out of the box with little or no extraneous configuration. With options for corporate support for this specific configuration for nervous CTOs!

    In the end, Brisk isn’t a public product yet, so we just don’t know really what it is capable of. However, we do know what Cassandra and Hadoop/Hive are both capable of and can have reasonable expectations of what Brisk should be able to do.

    Hbase and other systems can probably do everything I’ve outlined here, but for me the removal of the Namenode is the real point.

  • Pingback: 5 New Hadoop Products Launching Today: EMC Greenplum HD, DataStax Brisk and More | SEO College

  • Pingback: 5 New Hadoop Products Launching Today: EMC Greenplum HD, DataStax Brisk and More : Casa Semplice

  • Pingback: 5 New Hadoop Products Launching Today: EMC Greenplum HD, DataStax Brisk and More | Know All That!

  • Pingback: 5 New Hadoop Products Launching Today: EMC Greenplum HD, DataStax Brisk and More – developersarena.com

  • Pingback: 5 New Hadoop Products Launching Today: EMC Greenplum HD, DataStax Brisk and More

  • Pingback: 5 New Hadoop Products Launching Today: EMC Greenplum HD, DataStax Brisk and More | Scripting4U Blog

  • Pingback: 5 New Hadoop Products Launching Today: EMC Greenplum HD, DataStax Brisk and More | JetLib News

  • Jamie Moon

    @nmilford:disqus  Thanks for the interesting and useful post!  I stopped by here while i was looking for the information
    about integrating Hadoop and Cassandra. Frankly, I could not understand what you’re saying, but all the comments you
    gave was quite impressive and encouraged me!
    Hope another fresh comments or trying something about this kind of information.

    Thanks again.

    From Seoul, South Korea
    tearsend@gmail.com