Wednesday, 30 July 2014

A Beginners Guide to Hadoop

A Beginners Guide to Hadoop


The goal of this article is to provide a 10,000 foot view of Hadoop for those who know next to nothing about it. This article is not designed to get you ready for Hadoop development, but to provide a sound knowledge base for you to take the next steps in learning the technology.
Lets get down to it:
Hadoop is an Apache Software Foundation project that importantly provides two things:
  1. A distributed filesystem called HDFS (Hadoop Distributed File System)
  2. A framework and API for building and running MapReduce jobs
I will talk about these two things in turn. But first some links for your information:

HDFS

HDFS is structured similarly to a regular Unix filesystem except that data storage isdistributed across several machines. It is not intended as a replacement to a regular filesystem, but rather as a filesystem-like layer for large distributed systems to use. It has in built mechanisms to handle machine outages, and is optimized for throughput rather than latency.
There are two and a half types of machine in a HDFS cluster:
  • Datanode - where HDFS actually stores the data, there are usually quite a few of these.
  • Namenode - the ‘master’ machine. It controls all the meta data for the cluster. Eg - what blocks make up a file, and what datanodes those blocks are stored on.
  • Secondary Namenode - this is NOT a backup namenode, but is a separate service that keeps a copy of both the edit logs, and filesystem image, merging them periodically to keep the size reasonable.
    • this is soon being deprecated in favor of the backup node and the checkpoint node, but the functionality remains similar (if not the same)
hdfs diagram
Data can be accessed using either the Java API, or the Hadoop command line client. Many operations are similar to their Unix counterparts. Check out the documentation page for the full list, but here are some simple examples:
list files in the root directory
hadoop fs -ls /
list files in my home directory
hadoop fs -ls ./
cat a file (decompressing if needed)
hadoop fs -text ./file.txt.gz
upload and retrieve a file
hadoop fs -put ./localfile.txt /home/matthew/remotefile.txt

hadoop fs -get /home/matthew/remotefile.txt ./local/file/path/file.txt
Note that HDFS is optimized differently than a regular file system. It is designed for non-realtime applications demanding high throughput instead of online applications demanding low latency. For example, files cannot be modified once written, and the latency of reads/writes is really bad by filesystem standards. On the flip side, throughput scales fairly linearly with the number of datanodes in a cluster, so it can handle workloads no single machine would ever be able to.
HDFS also has a bunch of unique features that make it ideal for distributed systems:
  • Failure tolerant - data can be duplicated across multiple datanodes to protect against machine failures. The industry standard seems to be a replication factor of 3 (everything is stored on three machines).
  • Scalability - data transfers happen directly with the datanodes so your read/write capacity scales fairly well with the number of datanodes
  • Space - need more disk space? Just add more datanodes and re-balance
  • Industry standard - Lots of other distributed applications build on top of HDFS (HBase, Map-Reduce)
  • Pairs well with MapReduce - As we shall learn

HDFS Resources

For more information about the design of HDFS, you should read through apache documentation page.

MapReduce

The second fundamental part of Hadoop is the MapReduce layer. This is made up of two sub components:
  • An API for writing MapReduce workflows in Java.
  • A set of services for managing the execution of these workflows.

THE MAP AND REDUCE APIS

The basic premise is this:
  1. Map tasks perform a transformation.
  2. Reduce tasks perform an aggregation.
In scala, a simplified version of a MapReduce job might look like this:
def map(lineNumber: Long, sentance: String) = {
  val words = sentance.split()
  words.foreach{word =>
    output(word, 1)
  }
}


def reduce(word: String, counts: Iterable[Long]) = {
  var total = 0l
  counts.foreach{count =>
    total += count
  }
  output(word, total)
}
Notice that the output to a map and reduce task is always a KEY, VALUE pair. You always output exactly one key, and one value. The input to a reduce isKEY, ITERABLE[VALUE]. Reduce is called exactly once for each key output by the map phase. The ITERABLE[VALUE] is the set of all values output by the map phase for that key.
So if you had map tasks that output
map1: key: foo, value: 1
map2: key: foo, value: 32
Your reducer would receive:
key: foo, values: [1, 32]
Counter intuitively, one of the most important parts of a MapReduce job is what happens between map and reduce, there are 3 other stages; Partitioning, Sorting, and Grouping. In the default configuration, the goal of these intermediate steps is to ensure this behavior; that the values for each key are grouped together ready for the reduce()function. APIs are also provided if you want to tweak how these stages work (like if you want to perform a secondary sort).
Here’s a diagram of the full workflow to try and demonstrate how these pieces all fit together, but really at this stage it’s more important to understand how map and reduce interact rather than understanding all the specifics of how that is implemented.
mapreduce diagram
What’s really powerful about this API is that there is no dependency between any two of the same task. To do it’s job a map() task does not need to know about other map task, and similarly a single reduce() task has all the context it needs to aggregate for any particular key, it does not share any state with other reduce tasks.
Taken as a whole, this design means that the stages of the pipeline can be easily distributed to an arbitrary number of machines. Workflows requiring massive datasets can be easily distributed across hundreds of machines because there are no inherent dependencies between the tasks requiring them to be on the same machine.

MapReduce API Resources

If you want to learn more about MapReduce (generally, and within Hadoop) I recommend you read the Google MapReduce paper, the Apache MapReduce documentation, or maybe even the hadoop book. Performing a web search for MapReduce tutorials also offers a lot of useful information.
To make things more interesting, many projects have been built on top of the MapReduce API to ease the development of MapReduce workflows. For example Hive lets you write SQL to query data on HDFS instead of Java.

THE HADOOP SERVICES FOR EXECUTING MAPREDUCE JOBS

Hadoop MapReduce comes with two primary services for scheduling and running MapReduce jobs. They are the Job Tracker (JT) and the Task Tracker (TT). Broadly speaking the JT is the master and is in charge of allocating tasks to task trackers and scheduling these tasks globally. A TT is in charge of running the Map and Reduce tasks themselves.
When running, each TT registers itself with the JT and reports the number of ‘map’ and ‘reduce’ slots it has available, the JT keeps a central registry of these across all TTs and allocates them to jobs as required. When a task is completed, the TT re-registers that slot with the JT and the process repeats.
Many things can go wrong in a big distributed system, so these services have some clever tricks to ensure that your job finishes successfully: * Automatic retries - if a task fails, it is retried N times (usually 3) on different task trackers. * Data locality optimizations - if you co-locate a TT with a HDFS Datanode (which you should) it will take advantage of data locality to make reading the data faster * Blacklisting a bad TT- if the JT detects that a TT has too many failed tasks, it will blacklist it. No tasks will then be scheduled on this task tracker. * Speculative Execution - the JT can schedule the same task to run on several machines at the same time, just in case some machines are slower than others. When one version finishes, the others are killed.
Here’s a simple diagram of a typical deployment with TTs deployed alongside datanodes.hadoop infra

MapReduce Service Resources

For more reading on the JobTracker and TaskTracker check out Wikipedia or theHadoop book. I find the apache documentation pretty confusing when just trying to understand these things at a high level, so again doing a web-search can be pretty useful.

Wrap Up

I hope this introduction to Hadoop was useful. There is a lot of information on-line, but I didn’t feel like anything described Hadoop at a high-level for beginners.
The Hadoop project is a good deal more complex and deep than I have represented and is changing rapidly. For example, an initiative called MapReduce 2.0 provides a more general purpose job scheduling and resource management layer called YARN, and there is an ever growing range of non-MapReduce applications that run on top of HDFS, such as Cloudera Impala.

 

Books for Hadoop & Map Reduce

Books for Hadoop & Map Reduce

  • The Definitive guide is in some ways the ‘hadoop bible’, and can be an excellent reference when working on Hadoop, but do not expect it to provide a simple getting started tutorial for writing a Map Reduce. This book is great for really understanding how everything works and how all the systems fit together.
  • This is the book if you need to know the ins and outs of prototyping, deploying, configuring, optimizing, and tweaking a production Hadoop system. Eric Sammer is a very knowledgeable engineer, so this book is chock full of goodies.
  • Design Patterns is a great resource to get some insight into how to do non-trivial things with Hadoop. This book goes into useful detail on how to design specific types of algorithms, outlines why they should be designed that way, and provides examples.
  • One of the few non-O’Reilly books in this list, Hadoop in Action is similar to the definitive guide in that it provides a good reference for what Hadoop is and how to use it. It seems like this book provides a more gentle introduction to Hadoop compared to the other books in this list.
  • A slightly more advanced guide to running Hadoop. It includes chapters that detail how to best move data around, how to think in Map Reduce, and (importantly) how to debug and optimize your jobs.
  • This A-Press book claims it will guide you through initial hadoop set up while also helping you avoid many of the pitfalls that usual Hadoop novices encounter. Again it is similar in contents to Hadoop in Action and The Definitive Guide
  • Another Hadoop intro book, Hadoop Essentials focuses on providing a more practical introduction to Hadoop which seems ideal for a CS classroom setting
  • A book which aims to provide real-world examples of common hadoop problems. It also covers building integrated solutions using surrounding tools (hive, pig, girafe, etc)
  • The cookbook provides an introduction to installing / configuring Hadoop along with ‘more than 50 ready-to-use Hadoop MapReduce recipes’.
  • Released July 2013 this book promises to guide readers through writing and testing Cascading based workflows. This is one of the few books written about higher level Map Reduce frameworks, so I’m excited to give it a read.
  • A front to back guide to YARN, the next generation task management layer for Hadoop. This book is written (in part) by the YARN project founder, and the project lead.
  • This book is built around seven map reduce ‘recipes’ to learn from. It aims to be a consise, practical guide to get you coding.

Bonus

  • Russell introduces his own version of an agile tool-set for data analysis and exploration. The book covers both investigative tools (like Apache Pig), and visualization tools like D3. His pitch is pretty compelling
Related Posts Plugin for WordPress, Blogger...