Thursday, 31 July 2014

Big Data Basics - Part 6 - Related Apache Projects in Hadoop Ecosystem

Big Data Basics - Part 6 - Related Apache Projects in Hadoop Ecosystem

Problem

I have read the previous tips in the Big Data Basics series including the storage (HDFS) and computation (MapReduce) aspects. After reading through those tips, I understand that HDFS and MapReduce are the core components of Hadoop. Now, I want to know about other components that are part of the Hadoop Ecosystem.

Solution

In this tip we will take a look at some of the other popular Apache Projects that are part of the Hadoop Ecosystem.

Hadoop Ecosystem

As we learned in the previous tips, HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework. Now it's time to take a look at some of the other Apache Projects which are built around the Hadoop Framework which are part of the Hadoop Ecosystem. The following diagram shows some of the most popular Apache Projects/Frameworks that are part of the Hadoop Ecosystem.
Apache Hadoop Ecosystem
Next let us get an overview of each of the projects represented in the above diagram.

Apache Pig

Apache Pig is a software framework which offers a run-time environment for execution of MapReduce jobs on a Hadoop Cluster via a high-level scripting language called Pig Latin. The following are a few highlights of this project:
  • Pig is an abstraction (high level programming language) on top of a Hadoop cluster.
  • Pig Latin queries/commands are compiled into one or more MapReduce jobs and then executed on a Hadoop cluster.
  • Just like a real pig can eat almost anything, Apache Pig can operate on almost any kind of data.
  • Hadoop offers a shell called Grunt Shell for executing Pig commands.
  • DUMP and STORE are two of the most common commands in Pig. DUMP displays the results to screen and STORE stores the results to HDFS.
  • Pig offers various built-in operators, functions and other constructs for performing many common operations.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Apache Hive

Apache Hive Data Warehouse framework facilitates the querying and management of large datasets residing in a distributed store/file system like Hadoop Distributed File System (HDFS).  The following are a few highlights of this project:
  • Hive offers a technique to map a tabular structure on to data stored in distributed storage.
  • Hive supports most of the data types available in many popular relational database platforms.
  • Hive has various built-in functions, types, etc. for handling many commonly performed operations.
  • Hive allows querying of the data from distributed storage through the mapped tabular structure.
  • Hive offers various features, which are similar to relational databases, like partitioning, indexing, external tables, etc.
  • Hive manages its internal data (system catalog) like metadata about Hive Tables, Partitioning information, etc. in a separate database known as Hive Metastore.
  • Hive queries are written in a SQL-like language known as HiveQL.
  • Hive also allows plugging in custom mappers, custom reducers, custom user-defined functions, etc. to perform more sophisticated operations.
  • HiveQL queries are executed via MapReduce. Meaning, when a HiveQL query is issued, it triggers a Map and/or Reduce job(s) to perform the operation defined in the query.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Apache Mahout

Apache Mahout is a scalable machine learning and data mining library. The following are a few highlights of this project:
  • Mahout implements the machine learning and data mining algorithms using MapReduce.
  • Mahout has 4 major categories of algorithms: Collaborative Filtering, Classification, Clustering, and Dimensionality Reduction.
  • Mahout library contains two types of algorithms: Ones that can run in local mode and the others which can run in a distributed fashion.
  • More information on Algorithms: Mahout Algorithms.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Apache HBase

Apache HBase is a distributed, versioned, column-oriented, scalable and a big data store on top of Hadoop/HDFS. The following are a few highlights of this project:
  • HBase is based on Google's BigTable concept.
  • Runs on top of Hadoop and HDFS in a distributed fashion.
  • Supports Billions of Rows and Millions of Columns.
  • Runs on a cluster of commodity hardware and scales linearly.
  • Offers consistent reads and writes.
  • Offers easy to use Java APIs for client access.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Apache Sqoop

Apache Sqoop is a tool designed for efficiently transferring the data between Hadoop and Relational Databases (RDBMS). The following are a few highlights of this project:
  • Sqoop can efficiently transfer bulk data between HDFS and Relational Databases.
  • Sqoop allows importing the data into HDFS in an incremental fashion.
  • Sqoop can import and export data to and from HDFS, Hive, Relational Databases and Data Warehouses.
  • Sqoop uses MapReduce to import and export of data thereby effectively utilizing the parallelism and fault tolerance features of Hadoop.
  • Sqoop offers a command line commonly referred to as Sqoop command line.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Apache Oozie

Apache Oozie is a job workflow scheduling and coordination manager for managing the jobs executed on Hadoop. The following are a few highlights of this project:
  • Oozie can include both MapReduce as well as Non-MapReduce jobs.
  • Oozie is integrated with Hadoop and is an integral part of the Hadoop Ecosystem.
  • Oozie supports various jobs out of the box including MapReduce, Pig, Hive, Sqoop, etc.
  • Oozie jobs are scheduled/recurring jobs and are executed based on scheduled frequency and availability of data.
  • Oozie jobs are organized/arranged in a Directed Acyclic Graph (DAG) fashion.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Apache ZooKeeper

Apache ZooKeeper is an open source coordination service for distributed applications. The following are a few highlights of this project:
  • ZooKeeper is designed to be a centralized service.
  • ZooKeeper is responsible for maintaining configuration information, offering coordination in a distributed fashion, and a host of other capabilities.
  • ZooKeeper offers necessary tools for writing distributed applications which can coordinate effectively.
  • ZooKeeper simplifies the development of distributed applications.
  • ZooKeeper is being used by some of the Apache projects like HBase to offer high availability and high degree of coordination in a distributed environment.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Apache Ambari

Apache Ambari is an open source software framework for provisioning, managing, and monitoring Hadoop clusters. The following are few highlights of this project:
  • Ambari is useful for installing Hadoop services across different nodes of the cluster and handling the configuration of Hadoop Services on the cluster.
  • Ambari offers centralized management of the cluster including configuration and re-configuration of services, starting and stopping of cluster and a lot more.
  • Ambari offers a dashboard for monitoring the overall health of the cluster.
  • Ambari offers alerting and email mechanism to get the required attention when required.
  • Ambari offers REST APIs to developers for application integration.
Additional Information: Home Page | Wiki | Documentation/User Guide/Reference Manual | Mailing Lists

Conclusion

These are some of the popular Apache Projects. Apart from those, there are various other Apache Projects that are built around the Hadoop framework and have become part of the Hadoop Ecosystem. Some of these projects include
  • Apache Avro - An open source framework for Remote procedure calls (RPC) and data serialization and data exchange
  • Apache Spark - A fast and general engine for large-scale data processing
  • Apache Cassandra - A Distributed Non-SQL Big Data Database

References

Next Steps
  • Explore more about Big Data and Hadoop.
  • Explore more about various Apache Projects.

 

Big Data Basics - Part 5 - Introduction to MapReduce

Big Data Basics - Part 5 - Introduction to MapReduce

Problem

I have read the previous tips in the Big Data Basics series including the storage aspects (HDFS). I am curious about the computation aspect of Hadoop and want to know what it is all about, how it works, and any other relevant information.

Solution

In this tip we will take a look at the 2nd core component of Hadoop framework called MapReduce. This component is responsible for computation / data processing.

Introduction

MapReduce is basically a software programming model / software framework, which allows us to process data in parallel across multiple computers in a cluster, often running on commodity hardware, in a reliable and fault-tolerant fashion.

Key Concepts

Here are some of the key concepts related to MapReduce.

Job

A Job in the context of Hadoop MapReduce is the unit of work to be performed as requested by the client / user. The information associated with the Job includes the data to be processed (input data), MapReduce logic / program / algorithm, and any other relevant configuration information necessary to execute the Job.

Task

Hadoop MapReduce divides a Job into multiple sub-jobs known as Tasks. These tasks can be run independent of each other on various nodes across the cluster. There are primarily two types of Tasks - Map Tasks and Reduce Tasks.

JobTracker

Just like the storage (HDFS), the computation (MapReduce) also works in a master-slave / master-worker fashion. A JobTracker node acts as the Master and is responsible for scheduling / executing Tasks on appropriate nodes, coordinating the execution of tasks, sending the information for the execution of tasks, getting the results back after the execution of each task, re-executing the failed Tasks, and monitors / maintains the overall progress of the Job.  Since a Job consists of multiple Tasks, a Job's progress depends on the status / progress of Tasks associated with it. There is only one JobTracker node per Hadoop Cluster.

TaskTracker

A TaskTracker node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker. There is no restriction on the number of TaskTracker nodes that can exist in a Hadoop Cluster. TaskTracker receives the information necessary for execution of a Task from JobTracker, Executes the Task, and Sends the Results back to JobTracker.

Map()

Map Task in MapReduce is performed using the Map() function. This part of the MapReduce is responsible for processing one or more chunks of data and producing the output results.

Reduce()

The next part / component / stage of the MapReduce programming model is the Reduce() function. This part of the MapReduce is responsible for consolidating the results produced by each of the Map() functions/tasks.

Data Locality

MapReduce tries to place the data and the compute as close as possible. First, it tries to put the compute on the same node where data resides, if that cannot be done (due to reasons like compute on that node is down, compute on that node is performing some other computation, etc.), then it tries to put the compute on the node nearest to the respective data node(s) which contains the data to be processed. This feature of MapReduce is "Data Locality".

How Map Reduce Works

The following diagram shows the logical flow of a MapReduce programming model.
Hadoop - MapReduce - Logical Data Flow
Let us understand each of the stages depicted in the above diagram.
  • Input: This is the input data / file to be processed.
  • Split: Hadoop splits the incoming data into smaller pieces called "splits".
  • Map: In this step, MapReduce processes each split according to the logic defined in map() function. Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.
  • Combine: This is an optional step and is used to improve the performance by reducing the amount of data transferred across the network. Combiner is the same as the reduce step and is used for aggregating the output of the map() function before it is passed to the subsequent steps.
  • Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in order, and grouped before sending them to the next step.
  • Reduce: This step is used to aggregate the outputs of mappers using the reduce() function. Output of reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.
  • Output: Finally the output of reduce step is written to a file in HDFS.

MapReduce Word Count Example

For the purpose of understanding MapReduce, let us consider a simple example. Let us assume that we have a file which contains the following four lines of text.
Hadoop - MapReduce - Word Count Example - Input File
In this file, we need to count the number of occurrences of each word. For instance, DW appears twice, BI appears once, SSRS appears twice, and so on. Let us see how this counting operation is performed when this file is input to MapReduce.
Below is a simplified representation of the data flow for Word Count Example.
Hadoop - MapReduce - Word Count Example - Data Flow
  • Input: In this step, the sample file is input to MapReduce.
  • Split: In this step, Hadoop splits / divides our sample input file into four parts, each part made up of one line from the input file. Note that, for the purpose of this example, we are considering one line as each split. However, this is not necessarily true in a real-time scenario.
  • Map: In this step, each split is fed to a mapper which is the map() function containing the logic on how to process the input data, which in our case is the line of text present in the split. For our scenario, the map() function would contain the logic to count the occurrence of each word and each occurrence is captured / arranged as a (key, value) pair, which in our case is like (SQL, 1), (DW, 1), (SQL, 1), and so on.
  • Combine: This is an optional step and is often used to improve the performance by reducing the amount of data transferred across the network. This is essentially the same as the reducer (reduce() function) and acts on output from each mapper. In our example, the key value pairs from first mapper "(SQL, 1), (DW, 1), (SQL, 1)" are combined and the output of the corresponding combiner becomes "(SQL, 2), (DW, 1)".
  • Shuffle and Sort: In this step, output of all the mappers is collected, shuffled, and sorted and arranged to be sent to reducer.
  • Reduce: In this step, the collective data from various mappers, after being shuffled and sorted, is combined / aggregated and the word counts are produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.
  • Output: In this step, the output of the reducer is written to a file on HDFS. The following image is the output of our word count example.
Hadoop - MapReduce - Word Count Example - Output File

Highlights of Hadoop MapReduce

Here are few highlights of MapReduce programming model in Hadoop:
  • MapReduce works in a master-slave / master-worker fashion. JobTracker acts as the master and TaskTrackers act as the slaves.
  • MapReduce has two major phases - A Map phase and a Reduce phase. Map phase processes parts of input data using mappers based on the logic defined in the map() function. The Reduce phase aggregates the data using a reducer based on the logic defined in the reduce() function.
  • Depending upon the problem at hand, we can have One Reduce Task, Multiple Reduce Tasks or No Reduce Tasks.
  • MapReduce has built-in fault tolerance and hence can run on commodity hardware.
  • MapReduce takes care of distributing the data across various nodes, assigning the tasks to each of the nodes, getting the results back from each node, re-running the task in case of any node failures, consolidation of results, etc.
  • MapReduce processes the data in the form of (Key, Value) pairs. Hence, we need to fit out business problem in this Key-Value arrangement.

References

Next Steps

  • Explore more about Big Data and Hadoop
  • In the next and subsequent tips, we will look at the other aspects of Hadoop and the Big Data world. So stay tuned!

 

Related Posts Plugin for WordPress, Blogger...