- Thinking first about, and then building, your infrastructure/solution to fit your needs.
- Knowing the software inside and out.
- Relying on the community for the rest.
Over the converse ‘Enterprise’ approach of:
- Building your infrastructure based on someone’s white paper on how you should build an infrastructure to do X.
- Getting your sysadmins a set of meaningless certifications.
- Ultimately relying on commercial support as your last point of escalation.
Yes, yes, yes… this is very snobby and I am in danger of sounding as irreverent as Ted Dzuiba. I am also wholly conscious that OSS approach can be taken in the same extreme direction as the enterprise approach. in so much that everyone blindly follows the same design choices that Twitter or Facebook are doing (albeit better than anyone else) or are implementing everything in node.js or Ruby on Rails because that is what the hot-as-shit hipster developers are doing.
For me it comes down to having operational responsibility for your infrastructure, rather than a support contract. But still, I’m young and work at a hot startup; when I’m CTO of a bank maybe my view will change
Outbrain‘s Operations Culture is one in which members of Operations ARE the final point of escalation for the infrastructure our product lives on. (Naturally, developers are the final point of escalation beyond that for our custom internal software). If we can’t fix it with the help of the community, we didn’t do our job in understanding things, or we built it wrong.
Ownership, responsibility, and most of all, enthusiasm for both the technology and the expansion of our technical knowledge are key DNA for our team.
So, it is not hard to imagine that I should feel like all of these Hadoop hardware appliances (or non-standard hardware heavily marketed as Hadoop solutions like this guy) miss the point of an OSS project that comes out of shops with similar operational DNA and are simply CTO bait to exploit an excess of budget and limited depth of understanding about the product’s capabilities and potential.
Hadoop is one of those things that can be a Swiss Army knife, and can be setup to do quite a lot of different things. Here is a pretty key slide from my perennial Hadoop talk:
I love Cloudera; I think they’re absolutly fantastic folks who, without question, know their shit. I also absolutely love my Dell servers. This is not a rant against them, Greenplum, Oracle, EMC or anyone specifically. However, they should all be treated as differently shaped LEGOs rather than a complete kit.
There are the type of people who buy a LEGO set, build according to the instructions, buy the next set, and build out the whole Harry Potter LEGO village. Then there are folks who mix and match pieces from the Star Wars set, the Pirates of the Carribean set, and the Harry Potter set to make something new, different and specific.
When I was a kid, I had lots of large plastic tubs of loose LEGOs and would use them to create whatever I wanted. Vendors and software are much the same in my opinion.
A lot of what you want to do with Hadoop depends on your hardware bottle necks, the type of data, and what you’re doing with it. These appliances are usually configured into a sort of generic ‘optimum’ that appears to be sub-optimal for any specific workload at any scale.
It is tantamount to all of the car manufacturers in the world saying, ‘fuck you bitches, we’re only making SUVs’. Sure, they’re a good trade off for space and performance, but sometimes, you need more performance than space, or you need more fuel efficiency. I live in Brooklyn and I own a Mini specifically so that I can get into parking spots SUVs cannot.
A simple example to wrap your head around is the ratio of CPU cores to disks/storage per datanode/tasktracker.
- Tasks run on each node are split out as children of the TaskTracker process.
- The more cores you have, the higher you can bump up the number of concurrent tasks running.
- Usually, Hadoop admins eschew RAID for JBOD so each of those tasks’s IO can be more or less isolated to the individual disk the HDFS blocks your reading/writing are on.
So, with that little bit of knowledge:
- A setup with 12 x 2.5″ 300G 15K SAS drives and 24 cores will increase your parellelism and IO speed.
- A setup with 8 x 2TB 3.5″ 7.2K SATA drives and 8 cores will give you way more storage capacity, but less cluster-wide concurrency and node-level IO.
Scale that from a cluster of 5 nodes to 50, to 500 and those differences aggegate to quite a meaningful difference in performance and can improve or drag your jobs down. I know several companies that have separate clusters with the same data, but different hardware configs for scheduled production jobs (with SLAs) and for research and ad-hoc jobs.
Ultimately, this is wildly generalistic, since we also need to dial knobs for networking, RAM, architecture; but the point is that nothing, absolutely NOTHING beats knowing your use-case and usage patterns and building accordingly.
Know your usage patterns! If you don’t know, experiment before buying one of these pre-packaged Hadoop toys! And if you have existing EMC, Greenplum or Oracle infrastructure and need to glue them to Hadoop… well…those are just LEGOs…
Do you want to build your Hadoop cluster using DUPLO blocks to look like this?
Or do you want to build your Hadoop cluster to look like this?