I find myself running more and more Cassandra clusters and when we were on Chef 0.9.8 I was being lazy and just cloning my Cassandra cookbook per cluster. Not exactly a way to scale the manageability of your config
Now I’ve refactored the cookbook to allow me to manage multiple clusters by extracting the
initial_token from a databag. Once we start implementing the new Environments feature in Chef 0.10 I’ll be able to simplify this further.
I’m debating having the cookbook auto-generate tokens and assign them as well as re-generate/nodetool move/re-balance when I’ve added another node with that cluster specified in the databag. That’s a big project and for now I’m too much of a control freak to automate that, but I’m thinking on it.
I’ve also made it so the cookbook auto-generates the
cassandra-topology.properties for the
PropertyFileSnitch based off of location info stored in the databag.
Lets talk about Cassandra maintenance.
Nothing crazy here… these are just some notes I jotted down for folks I work with explaining a cronjob I put into production as well as providing the simple script. Thought some other people might benefit.
Datastax posted my talk (see below)!
Heads up, I’m going to be giving my Cassandra for Sysadmins’s talk at Cassandra NYC on Tuesday, December 6th.
Come by and say hello!
From 0.7 on up you can do rolling upgrades of your cluster.
A few weeks back I went from 0.7 to 0.8. Upgrade went as smooth as silk. It is sofa king awesome.
Will upgrade to 1.0 after holidays so as to bask in the glory of snappy compression, read performance gains and the leveled compaction.
Most of my process was semi-automated via Chef, but the steps below expand to what I did.
Before you start, please make sure to check for changes in the cassandra.yaml. From 0.7 to 0.8, seed strategy became pluggable as well as two or three other changes. In 1.0, I haven’t looked yet but I presume there will be other changes related to the pluggable compaction and compressions.
Spoke the other night with Jake Luciani at the NYC Chapter of the League or Professional System Administrators about Cassandra.
LOPSA is a fantastic organization that promotes Sysadmin issues and education. They have mentoring programs, conferences and meetups all around the world.
It’s a great place to meet new people, technologies and swap the always fun Sysadmin war stories. I am grateful to Matt Simmons for introducing me to the organization.
I spent a chunk of my vacation creating these slides, so you better like em.
Slides after the break…
So, I started playing with a beta of Brisk this weekend.
The Datastax guys are industrious, energentic and are very open to hearing from both the Cassandra and Hadoop communities. You should hit them in #Datastax-Brisk on Freenode IRC.
I’ll post more on my benchmarks and tests later, I’m still getting comfortable with it, but it is still very familiar, already being a Hadoop and Cassandra user.
I need to setup the OpsCenter stuff which looks pretty cool and put some real data in it.
So far, my favorite thing:
INFO 23:36:22,093 Chose seed 192.168.x.x as jobtracker
My current concern is how to deal with deletes in CFS (CassandraFS) as Hive (and Terasort for that matter) kicks up a lot of ephemeral data. Cassandra doesn’t delete stuff instantly, so I imagine I’ll need to do some tweaking with
GCGraceSeconds to find an optimal setting.
So, this is my quick 5 minute setup to get going and running benchmarks.
More info about why you might need this calculator is available here.
I wrote about it here.
So, I was hoping to write a little snippet of code to embed on my blog to allow people to get the token ranges for load balancing their cluster.
In Cassandra, when using the random partitioner, all keys are given a token (essentially an md5 of the Key) that is between 0 and 2^127 (0 through 170141183460469231731687303715884105728 for non-nerds). That range is known as the ring.
Each member node of the Cassandra cluster owns a range of those keys on the ring in the same vein you’d divide up a pie.
DataStax (née Riptano) is to Cassandra as Cloudera is to Hadoop (or Redhat is to Linux).
Brisk is DataStax’s upcoming Cassandra/Hadoop hybrid distribution. From thier site:
DataStax’ Brisk is an enhanced open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by the Cassandra DB.
They added Cassandra as an option for the Hadoop storage layer, allowing you to bypass HDFS; however, the implications of that go a whole lot further. You get the strengths of both systems here and lose some of the problems.
I’m pretty jazzed about this and I hope to convince my co-workers to give it a go. I’d like to tell you why.