Introduction to the Hadoop Ecosystem

When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and MapReduce, it soon became clear that Hadoop was not simply another application or service, but a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being around Hadoop, each addressing a variety of problem spaces and meeting different needs.
Many of these projects were begun by the same people or companies who were the major developers and early users of Hadoop; others were initiated by commercial Hadoop distributors. The majority of these projects now share a home with Hadoop at the Apache Software Foundation, which supports open-source software development and encourages the development of the communities surrounding these projects.
The following sections are meant to give the reader a brief introduction to the world of Hadoop and the core related software projects. There are countless commercial Hadoop-integrated products focused on making Hadoop more usable and layman-accessible, but the ones here were chosen because they provide core functionality and speed in Hadoop.
The so-called "Hadoop ecosystem" is, as befits an ecosystem, complex, evolving, and not easily parcelled into neat categories. Simply keeping track of all the project names may seem like a task of its own, but this pales in comparison to the task of tracking the functional and architectural differences between projects. These projects are not meant to all be used together, as parts of a single organism; some may even be seeking to solve the same problem in different ways. What unites them is that they each seek to tap into the scalability and power of Hadoop, particularly the HDFS component of Hadoop.

Additional Links

Cloudstory.com: 3-part series on Hadoop ecosystem
- Part 1
- Part 2
- Part 3

HDFS

The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files. HDFS is designed to be fault-tolerant due to data replication and distribution of data. When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data, which are stored across the cluster nodes designated for storage, a.k.a. DataNodes.

At the architectural level, HDFS requires a NameNode process to run on one node in the cluster and a DataNode service to run on each "slave" node that will be processing data. When data is loaded into HDFS, the data is replicated and split into blocks that are distributed across the DataNodes. The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the needed data resides.

HDFS Architecture

One significant drawback to HDFS is that it has a single point of failure (SPOF), which lies in the NameNode service. If the NameNode or the server hosting it goes down, HDFS is down for the entire cluster. The Secondary NameNode, which takes periodic snapshots of the NameNode and updates it, is not itself a backup NameNode.
Currently the most comprehensive solution to this problem comes from MapR, one of the major Hadoop distributors. MapR has developed a "distributed NameNode," where the HDFS metadata is distributed across the cluster in "Containers," which are tracked by the Container Location Database (CLDB).
Regular NameNode architecture vs. MapR's distributed NameNode architecture
Via MapR

The Apache community is also working to address this NameNode SPOF: Hadoop 2.0.2 will include an update to HDFS called HDFS High Availability (HA), which provides the user with " the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance." The active NameNode logs all changes to a directory that is also accessible by the standby NameNode, which then uses the log information to update itself.

Architecture of HDFS High Availability framework

MapReduce

The MapReduce paradigm for parallel processing comprises two sequential steps: map and reduce.
In the map phase, the input is a set of key-value pairs and the desired function is executed over each key/value pair in order to generate a set of intermediate key/value pairs.
In the reduce phase, the intermediate key/value pairs are grouped by key and the values are combined together according to the reduce code provided by the user; for example, summing. It is also possible that no reduce phase is required, given the type of operation coded by the user.

At the cluster level, the MapReduce processes are divided between two applications, JobTracker and TaskTracker. JobTracker runs on only 1 node of the cluster, while TaskTracker runs on every slave node in the cluster. Each MapReduce job is split into a number of tasks which are assigned to the various TaskTrackers depending on which data is stored on that node. JobTracker is responsible for scheduling job runs and managing computational resources across the cluster. JobTracker oversees the progress of each TaskTracker as they complete their individual tasks.

MapReduce Architecture

YARN

As Hadoop became more widely adopted and used on clusters with up to tens of thousands of nodes, it became obvious that MapReduce 1.0 had issues with scalability, memory usage, synchronization, and had its own SPOF issues. In response, YARN (Yet Another Resource Negotiator) was begun as a subproject in the Apache Hadoop Project, on par with other subprojects like HDFS, MapReduce, and Hadoop Common. YARN addresses problems with MapReduce 1.0's architecture, specifically with the JobTracker service.
Essentially, YARN "split[s] up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and per-application ApplicationMaster (AM)." (source: Apache) Thus, rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster.

YARN Architecture

MapReduce 2.0

MapReduce 2.0, or MR2, contains the same execution framework as MapReduce 1.0, but it is built on the scheduling/resource management framework of YARN.
YARN, contrary to widespread misconceptions, is not the same as MapReduce 2.0 (MRv2). Rather, YARN is a general framework which can support multiple instances of distributed processing applications, of which MapReduce 2.0 is one.

Additional Links

Cloudera blog: MR2 and YARN Briefly Explained
Hortonworks blog: Apache Hadoop YARN – Background and an Overview
Interview with Arun Murthy, co-founder of Hortonworks, about YARN

Hadoop-Related Projects at Apache

With the exception of Chukwa, Drill, and HCatalog (incubator-level projects), all other Apache projects mentioned here are top-level projects.
This list is not meant to be all-inclusive, but it serves as an introduction to some of the most commonly used projects, and also illustrates the range of capabilities being developed around Hadoop. To name just a couple, Whirr and Crunch are other Hadoop-related Apache projects not described here.

Avro

Avro is a framework for performing remote procedure calls and data serialization. In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. It is particularly suited for use with scripting languages such as Pig, because data is always stored with its schema in Avro, and therefore the data is self-describing.
Avro can also handle changes in schema, a.k.a. "schema evolution," while still preserving access to the data. For example, different schemas could be used in serialization and deserialization of a given data set.

Additional Links

Avro in 3 minutes

BigTop

BigTop is a project for packaging and testing the Hadoop ecosystem. Much of BigTop's code was initially developed and released as part of Cloudera's CDH distribution, but has since become its own project at Apache.
The current BigTop release (0.5.0) supports a number of Linux distributions and packages Hadoop together with the following projects: Zookeeper, Flume, HBase, Pig, Hive, Sqoop, Oozie, Whirr, Mahout, SolrCloud, Crunch, DataFu and Hue.

Additional Links

Apache blog post: What is BigTop?

Chukwa

Chukwa, currently in incubation, is a data collection and analysis system built on top of HDFS and MapReduce. Tailored for collecting logs and other data from distributed monitoring systems, Chukwa provides a workflow that allows for incremental data collection, processing and storage in Hadoop. It is included in the Apache Hadoop distribution, but as an independent module.

Drill

Drill is an incubation-level project at Apache and is an open-source version of Google's Dremel. Drill is a distributed system for executing interactive analysis over large-scale datasets. Some explicit goals of the Drill project are to support real-time querying of nested data and to scale to clusters of 10,000 nodes or more.
Designed to support nested data, Drill also supports data with (e.g. Avro) or without (e.g. JSON) schemas. Its primary language is an SQL-like language, DrQL, though the Mongo Query Language can also be used.

Flume

Flume is a tool for harvesting, aggregating and moving large amounts of log data in and out of Hadoop. Flume "channels" data between "sources" and "sinks" and its data harvesting can either be scheduled or event-driven. Possible sources for Flume include Avro, files, and system logs, and possible sinks include HDFS and HBase. Flume itself has a query processing engine, so that there is the option to transform each new batch of data before it is shuttled to the intended sink.
Since July 2012, Flume has been released as Flume NG (New Generation), as it differs significantly from its original incarnation, a.k.a Flume OG (Original Generation)..

Additional Links

Flume in 3 minutes

HBase

Based on Google's Bigtable, HBase "is an open-source, distributed, versioned, column-oriented store" that sits on top of HDFS. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. HBase does not provide its own query or scripting language, but is accessible through Java, Thrift, and REST APIs.
HBase depends on Zookeeper and runs a Zookeeper instance by default.

Additional Links

HBase in 3 minutes

HCatalog

An incubator-level project at Apache, HCatalog is a metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig with plans to expand to HBase using a common data model. HCatalog's goal is to simplify the user's interaction with HDFS data and enable data sharing between tools and execution platforms.

Additional Links

HCatalog in 3 minutes

Hive

Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Hive is widely used, and has itself become a "sub-platform" in the Hadoop ecosystem.
Hive's data model provides a structure that is more familiar than raw HDFS to most users. It is based primarily on three related data structures: tables, partitions, and buckets, where tables correspond to HDFS directories and can be divided into partitions, which in turn can be divided into buckets.

Additional Links

Hive in 3 minutes

Mahout

Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout:

recommendations, a.k.a. collective filtering
classification, a.k.a categorization
clustering
frequent itemset mining, a.k.a parallel frequent pattern mining

Mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion, and have been written to be executable in MapReduce.

Additional Links

Mahout and Machine Learning in 3 minutes

Oozie

Oozie is a job coordinator and workflow manager for jobs executed in Hadoop, which can include non-MapReduce jobs. It is integrated with the rest of the Apache Hadoop stack and, according to the Oozie site, it "support[s] several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts)."
An Oozie workflow is a collection of actions and Hadoop jobs arranged in a Directed Acyclic Graph (DAG), which is a common model for tasks that must be a in sequence and are subject to certain constraints.

Additional Links

Oozie in 3 minutes

Pig

Pig is a framework consisting of a high-level scripting language (Pig Latin) and a run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce.
Pig is more flexible than Hive with respect to possible data format, due to its data model. Via the Pig Wiki: "Pig's data model is similar to the relational data model, except that tuples (a.k.a. records or rows) can be nested. For example, you can have a table of tuples, where the third field of each tuple contains a table. In Pig, tables are called bags. Pig also has a 'map' data type, which is useful in representing semistructured data, e.g. JSON or XML."

Additional Links

Pig in 3 minutes

Sqoop

Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both directions between relational systems and HDFS or other Hadoop data stores, e.g. Hive or HBase.
According to the Sqoop blog, "You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses."

ZooKeeper

ZooKeeper is a service for maintaining configuration information, naming, providing distributed synchronization and providing group services. As the ZooKeeper wiki summarizes it, "ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system." ZooKeeper itself is a distributed service with "master" and "slave" nodes, and stores configuration information, etc. in memory on ZooKeeper servers.

Additional Links

Zookeeper in 3 minutes

Hadoop-Related Projects Outside Apache

There are also projects outside of Apache that build on or parallel the major Hadoop projects at Apache. Several of interest are described here.

Spark (UC Berkeley)

Spark is a parallel computing program which can operate over any Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an open-source project at the U.C. Berkeley AMPLab, and in its own words, Spark "was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining."
While often compared to MapReduce insofar as it also provides parallel processing over HDFS and other Hadoop input sources, Spark differs in two key ways:

Spark holds intermediate results in memory, rather than writing them to disk; this drastically reduces query return time
Spark supports more than just map and reduce functions, greatly expanding the set of possible analyses that can be executed over HDFS data

The first feature is the key to doing iterative algorithms on Hadoop: rather than reading from HDFS, performing MapReduce, writing the results back to HDFS (i.e. to disk) and repeating for each cycle, Spark reads data from HDFS, performs the computation, and stores the intermediate results in memory as Resilient Distributed Data Sets. Spark can then run the next set of computations on the results cached in memory, thereby skipping the time-consuming steps of writing the nth round results to HDFS and reading them back out for the (n+1)th round.

Additional Links

http://www.youtube.com/watch?v=N3ITxQcf6uQ

Shark (UC Berkeley)

Shark is essentially "Hive running on Spark." It utilizes the Apache Hive infrastructure, including the Hive metastore and HDFS, but it gives users the benefits of Spark (increased processing speed, additional functions besides map and reduce). This way, Shark users can execute the queries in HiveQL over the same HDFS data sets, but receive results in near-real-time fashion.

Impala (Cloudera)

Released by Cloudera, Impala is an open-source project which, like Apache Drill, was inspired by Google's paper on Dremel; the purpose of both is to facilitate real-time querying of data in HDFS or HBase. Imapala uses an SQL-like language that, though similar to HiveQL, is currently more limited than HiveQL. Because Impala relies on the Hive Metastore, Hive must be installed on a cluster in order for Impala to work.
The secret behind Impala's speed is that it "circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs." (Source: Cloudera)

Additional Links

E-learning Introduction to Impala
Marcel Kornacker: Webinar on Impala

Pages

Thursday, 31 July 2014