Automating Some Cassandra Maintenance


Lets talk about Cassandra maintenance.

Nothing crazy here… these are just some notes I jotted down for folks I work with explaining a cronjob I put into production as well as providing the simple script.  Thought some other people might benefit.

Continue reading

Share

Adjust your slab! Memcached 1.4.12 RPMs on CentOS 5.7.




So, memcached 1.4.11 lets you rebalance and reassign slab memory!

This is epic!

Info why this is epic here.

Info on the implementation is in the release notes

From the release notes, please remember that the slab reassignment feature is in beta and is subject to some changes.

I just took the regular spec file I found for the project elsewhere and modified it a little. I disabled the SASL stuff in my spec file since we don’t use it and I didn’t want to mess with building it.

EDIT: Actually, this article has revised for less yak shaving. With the help of Dormando and Justin Lintz. I was able to shed some unneeded dependencies.

So here you go:

Continue reading

Share

Kicking the tires on Hadoop 0.23: Pseudo-Distributed mode.


Thought I’d play a little with Hadoop 0.23 (a.k.a YARN, MR2, NextGen Hadoop) and dump my notes here.

Gotta keep my skillz sharp y’all so I don’t become irrelephant. (Yes, that just happened.)

Below I just setup a pseudo-distributed mode setup and run some examples on it, nothing crazy.

I’m hoping to test and write more on how 0.23 differs from the main line 0.20.x, 1.0 and CDH3 releases as well as playing with the NameNode federation and using some other paradigms like MPI, Hama and Spark.

Continue reading

Share

Building and Installing Python 2.7 RPMs on CentOS 5.7


I was asked today to install Python 2.7 on a CentOS based node and I thought I’d take this oportunity to add a companion article to my Python 2.6 article.

We’re all well aware that CentOS is pretty backwards when it comes to having the latest and greatest sotware packages and is particularly finicky when it comes to Python since so much of RHEL depends on it.

As a rule, I refuse to rush in and install anything in production that isn’t in a manageable package format such as RPM. I need to be able to predictably reproduce software installs across a large number of nodes.

The following steps will not clobber your default Python 2.4 install and will keep both CentOS and your developers happy.

So, here we go.

Continue reading

Share

Rolling Upgrades for Cassandra

From 0.7 on up you can do rolling upgrades of your cluster.

A few weeks back I went from 0.7 to 0.8. Upgrade went as smooth as silk. It is sofa king awesome.

Will upgrade to 1.0 after holidays so as to bask in the glory of snappy compression, read performance gains and the leveled compaction.

Most of my process was semi-automated via Chef, but the steps below expand to what I did.

Before you start, please make sure to check for changes in the cassandra.yaml. From 0.7 to 0.8, seed strategy became pluggable as well as two or three other changes. In 1.0, I haven’t looked yet but I presume there will be other changes related to the pluggable compaction and compressions.

Continue reading

Share

Building RPMs for and setting up StatsD and Graphite on CentOS.

A while back Etsy opensourced a little node.js daemon called StatsD that makes it easy for you to ‘Measure All the Things.’

In my current environment setting up graphs for the folks on the business team and on the dev team is difficult and time consuming as it has to funnel through ops. We’re a bottleneck :(

I’m hoping to implement StatsD to make graphing a service that most anyone can directly interact with and remove me and my team as the bottleneck.

Below are my notes for setting it up.

Continue reading

Share

In reference to Hadoop Appliances; or, how I’m an Open Source snob.

Let me prefix this rant/post by stating that I come from the more scrappy, ‘build it out from OSS’ sort of shop, so I am highly biased toward the approach of:

  • Thinking first about, and then building, your infrastructure/solution to fit your needs.
  • Knowing the software inside and out.
  • Relying on the community for the rest.

Over the converse ‘Enterprise’ approach of:

  • Building your infrastructure based on someone’s white paper on how you should build an infrastructure to do X.
  • Getting your sysadmins a set of meaningless certifications.
  • Ultimately relying on commercial support as your last point of escalation.

Yes, yes, yes… this is very snobby and I am in danger of sounding as irreverent as Ted Dzuiba.  I am also wholly conscious that OSS approach can be taken in the same extreme direction as the enterprise approach. in so much that everyone blindly follows the same design choices that Twitter or Facebook are doing (albeit better than anyone else) or are implementing everything in node.js or Ruby on Rails because that is what the hot-as-shit hipster developers are doing.

For me it comes down to having operational responsibility for your infrastructure, rather than a support contract.  But still, I’m young and work at a hot startup; when I’m CTO of a bank maybe my view will change :P

Continue reading

Share

Getting Brisk going on CentOS and rocking a Terasort.

So, I started playing with a beta of Brisk this weekend.

The Datastax guys are industrious, energentic and are very open to hearing from both the Cassandra and Hadoop communities.  You should hit them in #Datastax-Brisk on Freenode IRC.

I’ll post more on my benchmarks and tests later, I’m still getting comfortable with it, but it is still very familiar, already being a Hadoop and Cassandra user.

I need to setup the OpsCenter stuff which looks pretty cool and put some real data in it.

So far, my favorite thing:

INFO 23:36:22,093 Chose seed 192.168.x.x as jobtracker

Magic!

My current concern is how to deal with deletes in CFS (CassandraFS) as Hive (and Terasort for that matter) kicks up a lot of ephemeral data.  Cassandra doesn’t delete stuff instantly, so I imagine I’ll need to do some tweaking with GCGraceSeconds to find an optimal setting.

So, this is my quick 5 minute setup to get going and running benchmarks.

Continue reading

Share

In which I discourse on Java bloat and Cassandra Node Balancing.

So, I was hoping to write a little snippet of code to embed on my blog to allow people to get the token ranges for load balancing their cluster.

In Cassandra, when using the random partitioner, all keys are given a token (essentially an md5 of the Key) that is between 0 and 2^127 (0 through 170141183460469231731687303715884105728 for non-nerds). That range is known as the ring.

Each member node of the Cassandra cluster owns a range of those keys on the ring in the same vein you’d divide up a pie.

Continue reading

Share

Apache Cassandra 0.7 CentOS Quick Install (with Cassandra-Stress, MX4J & JNA)

I’m such a sad bastard.

I got stuck fixing a production issue and had to miss the inagural NYC Cassandra Meetup group :(

To attone, I figure I’d write a quickie Cassandra post.

Continue reading

Share