Introduction to the Hadoop Ecosystem
When
Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and
MapReduce, it soon became clear that Hadoop was not simply another
application or service, but a platform around which an entire ecosystem
of capabilities could be built. Since then, dozens of self-standing
software projects have sprung into being around Hadoop, each addressing a
variety of problem spaces and meeting different needs.
Many of these projects were begun by the same people or companies who
were the major developers and early users of Hadoop; others were
initiated by commercial Hadoop distributors. The majority of these
projects now share a home with Hadoop at the Apache Software Foundation,
which supports open-source software development and encourages the
development of the communities surrounding these projects.
The following sections are meant to give the reader a brief
introduction to the world of Hadoop and the core related software
projects. There are countless commercial Hadoop-integrated products
focused on making Hadoop more usable and layman-accessible, but the ones
here were chosen because they provide core functionality and speed in
Hadoop.
The so-called "Hadoop ecosystem" is, as befits an ecosystem, complex,
evolving, and not easily parcelled into neat categories. Simply keeping
track of all the project names may seem like a task of its own, but
this pales in comparison to the task of tracking the functional and
architectural differences between projects. These projects are not meant
to all be used together, as parts of a single organism; some may even
be seeking to solve the same problem in different ways. What unites them
is that they each seek to tap into the scalability and power of Hadoop,
particularly the HDFS component of Hadoop.
Additional Links
- Cloudstory.com: 3-part series on Hadoop ecosystem
HDFS
The Hadoop Distributed File System (HDFS) offers a way to store large
files across multiple machines, rather than requiring a single machine
to have disk capacity equal to/greater than the summed total size of the
files. HDFS is designed to be fault-tolerant due to data replication
and distribution of data. When a file is loaded into HDFS, it is
replicated and broken up into "blocks" of data, which are stored across
the cluster nodes designated for storage, a.k.a. DataNodes.
At the architectural level, HDFS requires a NameNode process to run
on one node in the cluster and a DataNode service to run on each "slave"
node that will be processing data. When data is loaded into HDFS, the
data is replicated and split into blocks that are distributed across the
DataNodes. The NameNode is responsible for storage and management of
metadata, so that when MapReduce or another execution framework calls
for the data, the NameNode informs it where the needed data resides.

HDFS Architecture
One significant drawback to HDFS is that it has a single point of
failure (SPOF), which lies in the NameNode service. If the NameNode or
the server hosting it goes down, HDFS is down for the entire cluster.
The Secondary NameNode, which takes periodic snapshots of the NameNode
and updates it, is not itself a backup NameNode.
Currently the most comprehensive solution to this problem comes from
MapR, one of the major Hadoop distributors. MapR has developed a
"distributed NameNode," where the HDFS metadata is distributed across
the cluster in "Containers," which are tracked by the Container Location
Database (CLDB).
Regular NameNode architecture vs. MapR's distributed NameNode architecture
Via MapR
The Apache community is also working to address this NameNode SPOF:
Hadoop 2.0.2 will include an update to HDFS called HDFS High
Availability (HA), which provides the user with " the option of running
two redundant NameNodes in the same cluster in an Active/Passive
configuration with a hot standby. This allows a fast failover to a new
NameNode in the case that a machine crashes, or a graceful
administrator-initiated failover for the purpose of planned
maintenance." The active NameNode logs all changes to a directory that
is also accessible by the standby NameNode, which then uses the log
information to update itself.

Architecture of HDFS High Availability framework
MapReduce
The MapReduce paradigm for parallel processing comprises two sequential steps: map and reduce.
In the map phase, the input is a set of key-value pairs and the
desired function is executed over each key/value pair in order to
generate a set of intermediate key/value pairs.
In the reduce phase, the intermediate key/value pairs are grouped by
key and the values are combined together according to the reduce code
provided by the user; for example, summing. It is also possible that no
reduce phase is required, given the type of operation coded by the user.
At the cluster level, the MapReduce processes are divided between two
applications, JobTracker and TaskTracker. JobTracker runs on only 1
node of the cluster, while TaskTracker runs on every slave node in the
cluster. Each MapReduce job is split into a number of tasks which are
assigned to the various TaskTrackers depending on which data is stored
on that node. JobTracker is responsible for scheduling job runs and
managing computational resources across the cluster. JobTracker oversees
the progress of each TaskTracker as they complete their individual
tasks.

MapReduce Architecture
YARN
As Hadoop became more widely adopted and used on clusters with up to
tens of thousands of nodes, it became obvious that MapReduce 1.0 had
issues with scalability, memory usage, synchronization, and had its own
SPOF issues. In response, YARN (Yet Another Resource Negotiator) was
begun as a subproject in the Apache Hadoop Project, on par with other
subprojects like HDFS, MapReduce, and Hadoop Common. YARN addresses
problems with MapReduce 1.0's architecture, specifically with the
JobTracker service.
Essentially, YARN "split[s] up the two major functionalities of the
JobTracker, resource management and job scheduling/monitoring, into
separate daemons. The idea is to have a global Resource Manager (RM) and
per-application ApplicationMaster (AM)." (source: Apache) Thus, rather
than burdening a single node with handling scheduling and resource
management for the entire cluster, YARN now distributes this
responsibility across the cluster.

YARN Architecture
MapReduce 2.0
MapReduce 2.0, or MR2, contains the same execution framework as
MapReduce 1.0, but it is built on the scheduling/resource management
framework of YARN.
YARN, contrary to widespread misconceptions, is not the same as
MapReduce 2.0 (MRv2). Rather, YARN is a general framework which can
support multiple instances of distributed processing applications, of
which MapReduce 2.0 is one.
Additional Links
Hadoop-Related Projects at Apache
With the exception of Chukwa, Drill, and HCatalog (incubator-level
projects), all other Apache projects mentioned here are top-level
projects.
This list is not meant to be all-inclusive, but it serves as an
introduction to some of the most commonly used projects, and also
illustrates the range of capabilities being developed around Hadoop. To
name just a couple,
Whirr and
Crunch are other Hadoop-related Apache projects not described here.

Avro is a framework for performing remote procedure calls and data
serialization. In the context of Hadoop, it can be used to pass data
from one program or language to another, e.g. from C to Pig. It is
particularly suited for use with scripting languages such as Pig,
because data is always stored with its schema in Avro, and therefore the
data is self-describing.
Avro can also handle changes in schema, a.k.a. "schema evolution,"
while still preserving access to the data. For example, different
schemas could be used in serialization and deserialization of a given
data set.
Additional Links

BigTop is a project for packaging and testing the Hadoop ecosystem.
Much of BigTop's code was initially developed and released as part of
Cloudera's CDH distribution, but has since become its own project at
Apache.
The current BigTop release (0.5.0) supports a number of Linux
distributions and packages Hadoop together with the following projects:
Zookeeper, Flume, HBase, Pig, Hive, Sqoop, Oozie, Whirr, Mahout,
SolrCloud, Crunch, DataFu and Hue.
Additional Links

Chukwa, currently in incubation, is a data collection and analysis
system built on top of HDFS and MapReduce. Tailored for collecting logs
and other data from distributed monitoring systems, Chukwa provides a
workflow that allows for incremental data collection, processing and
storage in Hadoop. It is included in the Apache Hadoop distribution,
but as an independent module.

Drill is an incubation-level project at Apache and is an open-source
version of Google's Dremel. Drill is a distributed system for executing
interactive analysis over large-scale datasets. Some explicit goals of
the Drill project are to support real-time querying of nested data and
to scale to clusters of 10,000 nodes or more.
Designed to support nested data, Drill also supports data with (e.g.
Avro) or without (e.g. JSON) schemas. Its primary language is an
SQL-like language, DrQL, though the Mongo Query Language can also be
used.

Flume is a tool for harvesting, aggregating and moving large amounts
of log data in and out of Hadoop. Flume "channels" data between
"sources" and "sinks" and its data harvesting can either be scheduled or
event-driven. Possible sources for Flume include Avro, files, and
system logs, and possible sinks include HDFS and HBase. Flume itself has
a query processing engine, so that there is the option to transform
each new batch of data before it is shuttled to the intended sink.
Since July 2012, Flume has been released as Flume NG (New
Generation), as it differs significantly from its original incarnation,
a.k.a Flume OG (Original Generation)..
Additional Links

Based on Google's Bigtable, HBase "is an open-source, distributed,
versioned, column-oriented store" that sits on top of HDFS. HBase is
column-based rather than row-based, which enables high-speed execution
of operations performed over similar values across massive data sets,
e.g. read/write operations that involve all rows but only a small subset
of all columns. HBase does not provide its own query or scripting
language, but is accessible through Java, Thrift, and REST APIs.
HBase depends on Zookeeper and runs a Zookeeper instance by default.
Additional Links

An incubator-level project at Apache, HCatalog is a metadata and
table storage management service for HDFS. HCatalog depends on the Hive
metastore and exposes it to other services such as MapReduce and Pig
with plans to expand to HBase using a common data model. HCatalog's goal
is to simplify the user's interaction with HDFS data and enable data
sharing between tools and execution platforms.
Additional Links

Hive provides a warehouse structure and SQL-like access for data in
HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's query
language, HiveQL, compiles to MapReduce. It also allows user-defined
functions (UDFs). Hive is widely used, and has itself become a
"sub-platform" in the Hadoop ecosystem.
Hive's data model provides a structure that is more familiar than raw
HDFS to most users. It is based primarily on three related data
structures: tables, partitions, and buckets, where tables correspond to
HDFS directories and can be divided into partitions, which in turn can
be divided into buckets.
Additional Links

Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout:
- recommendations, a.k.a. collective filtering
- classification, a.k.a categorization
- clustering
- frequent itemset mining, a.k.a parallel frequent pattern mining
Mahout is not simply a collection of pre-existing algorithms; many
machine learning algorithms are intrinsically non-scalable; that is,
given the types of operations they perform, they cannot be executed as a
set of parallel processes. Algorithms in the Mahout library belong to
the subset that can be executed in a distributed fashion, and have been
written to be executable in MapReduce.
Additional Links

Oozie is a job coordinator and workflow manager for jobs executed in
Hadoop, which can include non-MapReduce jobs. It is integrated with the
rest of the Apache Hadoop stack and, according to the Oozie site, it
"support[s] several types of Hadoop jobs out of the box (such as Java
map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well
as system specific jobs (such as Java programs and shell scripts)."
An Oozie workflow is a collection of actions and Hadoop jobs arranged
in a Directed Acyclic Graph (DAG), which is a common model for tasks
that must be a in sequence and are subject to certain constraints.
Additional Links

Pig is a framework consisting of a high-level scripting language (Pig
Latin) and a run-time environment that allows users to execute
MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a
higher-level language that compiles to MapReduce.
Pig is more flexible than Hive with respect to possible data format,
due to its data model. Via the Pig Wiki: "Pig's data model is similar
to the relational data model, except that tuples (a.k.a. records or
rows) can be nested. For example, you can have a table of tuples, where
the third field of each tuple contains a table. In Pig, tables are
called bags. Pig also has a 'map' data type, which is useful in
representing semistructured data, e.g. JSON or XML."
Additional Links

Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both
directions between relational systems and HDFS or other Hadoop data
stores, e.g. Hive or HBase.
According to the Sqoop blog, "You can use Sqoop to import data from
external structured datastores into Hadoop Distributed File System or
related systems like Hive and HBase. Conversely, Sqoop can be used to
extract data from Hadoop and export it to external structured datastores
such as relational databases and enterprise data warehouses."

ZooKeeper is a service for maintaining configuration information,
naming, providing distributed synchronization and providing group
services. As the ZooKeeper wiki summarizes it, "ZooKeeper allows
distributed processes to coordinate with each other through a shared
hierarchical name space of data registers (we call these registers
znodes), much like a file system." ZooKeeper itself is a distributed
service with "master" and "slave" nodes, and stores configuration
information, etc. in memory on ZooKeeper servers.
Additional Links
Hadoop-Related Projects Outside Apache
There are also projects outside of Apache that build on or parallel
the major Hadoop projects at Apache. Several of interest are described
here.
Spark (UC Berkeley)

Spark is a parallel computing program which can operate over any
Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an
open-source project at the U.C. Berkeley AMPLab, and in its own words,
Spark "was initially developed for two applications where keeping data
in memory helps: iterative algorithms, which are common in machine
learning, and interactive data mining."
While often compared to MapReduce insofar as it also provides
parallel processing over HDFS and other Hadoop input sources, Spark
differs in two key ways:
- Spark holds intermediate results in memory, rather than writing them to disk; this drastically reduces query return time
- Spark supports more than just map and reduce functions, greatly
expanding the set of possible analyses that can be executed over HDFS
data
The first feature is the key to doing iterative algorithms on Hadoop:
rather than reading from HDFS, performing MapReduce, writing the
results back to HDFS (i.e. to disk) and repeating for each cycle, Spark
reads data from HDFS, performs the computation, and stores the
intermediate results in memory as Resilient Distributed Data Sets. Spark
can then run the next set of computations on the results cached in
memory, thereby skipping the time-consuming steps of writing the nth
round results to HDFS and reading them back out for the (n+1)th round.
Additional Links
Shark (UC Berkeley)

Shark is essentially "Hive running on Spark." It utilizes the Apache
Hive infrastructure, including the Hive metastore and HDFS, but it gives
users the benefits of Spark (increased processing speed, additional
functions besides map and reduce). This way, Shark users can execute the
queries in HiveQL over the same HDFS data sets, but receive results in
near-real-time fashion.

Released by Cloudera, Impala is an open-source project which, like
Apache Drill, was inspired by Google's paper on Dremel; the purpose of
both is to facilitate real-time querying of data in HDFS or HBase.
Imapala uses an SQL-like language that, though similar to HiveQL, is
currently more limited than HiveQL. Because Impala relies on the Hive
Metastore, Hive must be installed on a cluster in order for Impala to
work.
The secret behind Impala's speed is that it "circumvents MapReduce to
directly access the data through a specialized distributed query engine
that is very similar to those found in commercial parallel RDBMSs."
(Source: Cloudera)
Additional Links