A Beginners Guide to Hadoop

The goal of this article is to provide a 10,000 foot view of Hadoop for those who know next to nothing about it. This article is not designed to get you ready for Hadoop development, but to provide a sound knowledge base for you to take the next steps in learning the technology.

Lets get down to it:

Hadoop is an Apache Software Foundation project that importantly provides two things:

A distributed filesystem called HDFS (Hadoop Distributed File System)
A framework and API for building and running MapReduce jobs

I will talk about these two things in turn. But first some links for your information:

HDFS

HDFS is structured similarly to a regular Unix filesystem except that data storage isdistributed across several machines. It is not intended as a replacement to a regular filesystem, but rather as a filesystem-like layer for large distributed systems to use. It has in built mechanisms to handle machine outages, and is optimized for throughput rather than latency.

There are two and a half types of machine in a HDFS cluster:

Datanode - where HDFS actually stores the data, there are usually quite a few of these.
Namenode - the ‘master’ machine. It controls all the meta data for the cluster. Eg - what blocks make up a file, and what datanodes those blocks are stored on.
Secondary Namenode - this is NOT a backup namenode, but is a separate service that keeps a copy of both the edit logs, and filesystem image, merging them periodically to keep the size reasonable.
- this is soon being deprecated in favor of the backup node and the checkpoint node, but the functionality remains similar (if not the same)

Data can be accessed using either the Java API, or the Hadoop command line client. Many operations are similar to their Unix counterparts. Check out the documentation page for the full list, but here are some simple examples:

list files in the root directory

hadoop fs -ls /

list files in my home directory

hadoop fs -ls ./

cat a file (decompressing if needed)

hadoop fs -text ./file.txt.gz

upload and retrieve a file

hadoop fs -put ./localfile.txt /home/matthew/remotefile.txt

hadoop fs -get /home/matthew/remotefile.txt ./local/file/path/file.txt

Note that HDFS is optimized differently than a regular file system. It is designed for non-realtime applications demanding high throughput instead of online applications demanding low latency. For example, files cannot be modified once written, and the latency of reads/writes is really bad by filesystem standards. On the flip side, throughput scales fairly linearly with the number of datanodes in a cluster, so it can handle workloads no single machine would ever be able to.

HDFS also has a bunch of unique features that make it ideal for distributed systems:

Failure tolerant - data can be duplicated across multiple datanodes to protect against machine failures. The industry standard seems to be a replication factor of 3 (everything is stored on three machines).
Scalability - data transfers happen directly with the datanodes so your read/write capacity scales fairly well with the number of datanodes
Space - need more disk space? Just add more datanodes and re-balance
Industry standard - Lots of other distributed applications build on top of HDFS (HBase, Map-Reduce)
Pairs well with MapReduce - As we shall learn

HDFS Resources

For more information about the design of HDFS, you should read through apache documentation page.

MapReduce

The second fundamental part of Hadoop is the MapReduce layer. This is made up of two sub components:

An API for writing MapReduce workflows in Java.
A set of services for managing the execution of these workflows.

THE MAP AND REDUCE APIS

The basic premise is this:

Map tasks perform a transformation.
Reduce tasks perform an aggregation.

In scala, a simplified version of a MapReduce job might look like this:

def map(lineNumber: Long, sentance: String) = {
  val words = sentance.split()
  words.foreach{word =>
    output(word, 1)
  }
}


def reduce(word: String, counts: Iterable[Long]) = {
  var total = 0l
  counts.foreach{count =>
    total += count
  }
  output(word, total)
}

Notice that the output to a map and reduce task is always a KEY, VALUE pair. You always output exactly one key, and one value. The input to a reduce isKEY, ITERABLE[VALUE]. Reduce is called exactly once for each key output by the map phase. The ITERABLE[VALUE] is the set of all values output by the map phase for that key.

So if you had map tasks that output

map1: key: foo, value: 1
map2: key: foo, value: 32

Your reducer would receive:

key: foo, values: [1, 32]

Counter intuitively, one of the most important parts of a MapReduce job is what happens between map and reduce, there are 3 other stages; Partitioning, Sorting, and Grouping. In the default configuration, the goal of these intermediate steps is to ensure this behavior; that the values for each key are grouped together ready for the reduce()function. APIs are also provided if you want to tweak how these stages work (like if you want to perform a secondary sort).

Here’s a diagram of the full workflow to try and demonstrate how these pieces all fit together, but really at this stage it’s more important to understand how map and reduce interact rather than understanding all the specifics of how that is implemented.

What’s really powerful about this API is that there is no dependency between any two of the same task. To do it’s job a map() task does not need to know about other map task, and similarly a single reduce() task has all the context it needs to aggregate for any particular key, it does not share any state with other reduce tasks.

Taken as a whole, this design means that the stages of the pipeline can be easily distributed to an arbitrary number of machines. Workflows requiring massive datasets can be easily distributed across hundreds of machines because there are no inherent dependencies between the tasks requiring them to be on the same machine.

MapReduce API Resources

If you want to learn more about MapReduce (generally, and within Hadoop) I recommend you read the Google MapReduce paper, the Apache MapReduce documentation, or maybe even the hadoop book. Performing a web search for MapReduce tutorials also offers a lot of useful information.

To make things more interesting, many projects have been built on top of the MapReduce API to ease the development of MapReduce workflows. For example Hive lets you write SQL to query data on HDFS instead of Java.

THE HADOOP SERVICES FOR EXECUTING MAPREDUCE JOBS

Hadoop MapReduce comes with two primary services for scheduling and running MapReduce jobs. They are the Job Tracker (JT) and the Task Tracker (TT). Broadly speaking the JT is the master and is in charge of allocating tasks to task trackers and scheduling these tasks globally. A TT is in charge of running the Map and Reduce tasks themselves.

When running, each TT registers itself with the JT and reports the number of ‘map’ and ‘reduce’ slots it has available, the JT keeps a central registry of these across all TTs and allocates them to jobs as required. When a task is completed, the TT re-registers that slot with the JT and the process repeats.

Many things can go wrong in a big distributed system, so these services have some clever tricks to ensure that your job finishes successfully: * Automatic retries - if a task fails, it is retried N times (usually 3) on different task trackers. * Data locality optimizations - if you co-locate a TT with a HDFS Datanode (which you should) it will take advantage of data locality to make reading the data faster * Blacklisting a bad TT- if the JT detects that a TT has too many failed tasks, it will blacklist it. No tasks will then be scheduled on this task tracker. * Speculative Execution - the JT can schedule the same task to run on several machines at the same time, just in case some machines are slower than others. When one version finishes, the others are killed.

Here’s a simple diagram of a typical deployment with TTs deployed alongside datanodes. hadoop infra

MapReduce Service Resources

For more reading on the JobTracker and TaskTracker check out Wikipedia or theHadoop book. I find the apache documentation pretty confusing when just trying to understand these things at a high level, so again doing a web-search can be pretty useful.

Wrap Up

I hope this introduction to Hadoop was useful. There is a lot of information on-line, but I didn’t feel like anything described Hadoop at a high-level for beginners.

The Hadoop project is a good deal more complex and deep than I have represented and is changing rapidly. For example, an initiative called MapReduce 2.0 provides a more general purpose job scheduling and resource management layer called YARN, and there is an ever growing range of non-MapReduce applications that run on top of HDFS, such as Cloudera Impala.

Books for Hadoop & Map Reduce

Hadoop: The Definitive Guide by Tom White

The Definitive guide is in some ways the ‘hadoop bible’, and can be an excellent reference when working on Hadoop, but do not expect it to provide a simple getting started tutorial for writing a Map Reduce. This book is great for really understanding how everything works and how all the systems fit together.
Hadoop Operations by Eric Sammer

This is the book if you need to know the ins and outs of prototyping, deploying, configuring, optimizing, and tweaking a production Hadoop system. Eric Sammer is a very knowledgeable engineer, so this book is chock full of goodies.
Map Reduce Design Patterns by Donald Miller and Adam Shook

Design Patterns is a great resource to get some insight into how to do non-trivial things with Hadoop. This book goes into useful detail on how to design specific types of algorithms, outlines why they should be designed that way, and provides examples.
Hadoop in Action by Chuck Lam

One of the few non-O’Reilly books in this list, Hadoop in Action is similar to the definitive guide in that it provides a good reference for what Hadoop is and how to use it. It seems like this book provides a more gentle introduction to Hadoop compared to the other books in this list.
Hadoop in Practice by Alex Holmes

A slightly more advanced guide to running Hadoop. It includes chapters that detail how to best move data around, how to think in Map Reduce, and (importantly) how to debug and optimize your jobs.
Pro Hadoop by Jason Venner

This A-Press book claims it will guide you through initial hadoop set up while also helping you avoid many of the pitfalls that usual Hadoop novices encounter. Again it is similar in contents to Hadoop in Action and The Definitive Guide
Hadoop Essentials: A Quantitative Approach by Henry Liu

Another Hadoop intro book, Hadoop Essentials focuses on providing a more practical introduction to Hadoop which seems ideal for a CS classroom setting
Real World Hadoop Solutions Cookbook by Jonathan Owens, Brian Femiano & Jon Lentz

A book which aims to provide real-world examples of common hadoop problems. It also covers building integrated solutions using surrounding tools (hive, pig, girafe, etc)
Hadoop Map Reduce Cookbook by Srinath Perera

The cookbook provides an introduction to installing / configuring Hadoop along with ‘more than 50 ready-to-use Hadoop MapReduce recipes’.
Enterprise Data Workflows with Cascading

Released July 2013 this book promises to guide readers through writing and testing Cascading based workflows. This is one of the few books written about higher level Map Reduce frameworks, so I’m excited to give it a read.
Apache Hadoop Yarn by Arun Murthy et al.

A front to back guide to YARN, the next generation task management layer for Hadoop. This book is written (in part) by the YARN project founder, and the project lead.
Instant Map Reduce Patterns by Srinath Perera

This book is built around seven map reduce ‘recipes’ to learn from. It aims to be a consise, practical guide to get you coding.

Programming Hive by Edward Capriolo, Dean Wampler & Jason Rutherglen

A detailed guide for understanding, running, debugging, and extending Hive
Programming Pig by Alan Gates

Programming Pig describes pig, walks you through how to use it, and helps you understand how to extend it
HBase the Definitive Guide by Lars George

This book is to HBase what the Hadoop Guide is to Hadoop, a comprehensive walk-through of HBase, how it works, how to use it, and how it is designed.
Apache Sqoop Cookbook by Kathleen Tang & Jaroslav Cecho (released on June 22nd 2013)

A standalone Sqoop recipe book which covers common usage and integrations
Mahout in Action by Sean Owen, Robin Anil, Ted Dunning & Ellen Friedman

Apache Mahout is a set of machine learning libraries for Hadoop. This book provides a hands-on introduction and some sample use-cases.
Fast data processing with Spark by Holden Karau

Holden walks through the ins and outs of Apache Spark including set up, interactive querying, and job deployment. Fun fact - I used to work with Holden, he’s a super smart guy so I’m sure this book is excellent.

Bonus

Agile Data by Russell Jurney

Russell introduces his own version of an agile tool-set for data analysis and exploration. The book covers both investigative tools (like Apache Pig), and visualization tools like D3. His pitch is pretty compelling

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Wednesday, 30 July 2014