Thursday, 31 July 2014

Big Data Basics - Part 4 - Introduction to HDFS

Big Data Basics - Part 4 - Introduction to HDFS

Problem

I have read the previous tips in the Big Data Basics series and I would like to know more about the Hadoop Distributed File System (HDFS). I would like to know about relevant information related to HDFS.

Solution

Before we take a look at the architecture of HDFS, let us first take a look at some of the key concepts.

Introduction

HDFS stands for Hadoop Distributed File System. HDFS is one of the core components of the Hadoop framework and is responsible for the storage aspect. Unlike the usual storage available on our computers, HDFS is a Distributed File System and parts of a single large file can be stored on different nodes across the cluster. HDFS is a distributed, reliable, and scalable file system.

Key Concepts

Here are some of the key concepts related to HDFS.

NameNode

HDFS works in a master-slave/master-worker fashion. All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode. A NameNode serves as the master and there is only one NameNode per cluster.

DataNode

DataNode is the slave/worker node and holds the user data in the form of Data Blocks. There can be any number of DataNodes in a Hadoop Cluster.

Data Block

A Data Block can be considered as the standard unit of data/files stored on HDFS. Each incoming file is broken into 64 MB by default. Any file larger than 64 MB is broken down into 64 MB blocks and all the blocks which make up a particular file are of the same size (64 MB) except for the last block which might be less than 64 MB depending upon the size of the file.

Replication

Data blocks are replicated across different nodes in the cluster to ensure a high degree of fault tolerance. Replication enables the use of low cost commodity hardware for the storage of data. The number of replicas to be made/maintained is configurable at the cluster level as well as for each file. Based on the Replication Factor, each file (data block which forms each file) is replicated many times across different nodes in the cluster.

Rack Awareness

Data is replicated across different nodes in the cluster to ensure reliability/fault tolerance. Replication of data blocks is done based on the location of the data node, so as to ensure high degree of fault tolerance. For instance, one or two copies of data blocks are stored on the same rack, one copy is stored on a different rack within the same data center, one more block on a rack in a different data center, and so on.

HDFS Architecture

Below is a high-level architecture of HDFS.
Architecture of Multi-Node Hadoop Cluster

Image Source: http://hadoop.apache.org/docs/stable1/hdfs_design.html Here are the highlights of this architecture.

Master-Slave/Master-Worker Architecture

HDFS works in a master-slave/master-worker fashion.
  • NameNode servers as the master and each DataNode servers as a worker/slave.
  • NameNode and each DataNode have built-in web servers.
  • NameNode is the heart of HDFS and is responsible for various tasks including - it holds the file system namespace, controls access to file system by the clients, keeps track of the DataNodes, keeps track of replication factor and ensures that it is always maintained.
  • User data is stored on the local file system of DataNodes. DataNode is not aware of the files to which the blocks stored on it belong to. As a result of this, if the NameNode goes down then the data in HDFS is non-usable as only the NameNode knows which blocks belong to which file, where each block located etc.
  • NameNode can talk to all the live DataNodes and the live DataNodes can talk to each other.
  • There is also a Secondary NameNode which comes in handy when the Primary NameNode goes down. Secondary NameNode can be brought up to bring the cluster online. This process of switching of nodes needs to be done manually and there is no automatic failover mechanism in place.
  • NameNode receives heartbeat signals and a Block Report periodically from each of the DataNodes.
    • Heartbeat signals from a DataNode indicates that the corresponding DataNode is alive and is working fine. If a heartbeat signal is not received from a DataNode then that DataNode is marked as dead and no further I/O requests are sent to that DataNode.
    • Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode.
    • Heartbeat signals and Block Reports received from each DataNode help the NameNode to identify any potential loss of Blocks/Data and to replicate the Blocks to other active DataNodes in the event when one or more DataNodes holding the data blocks go down.

Data Replication

Data is replicated across different DataNodes to ensure a high degree of fault-tolerance.
  • Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file level.
  • The need for data replication can arise in various scenarios like the Replication Factor is changed, a DataNode goes down, and when Data Blocks get corrupted, etc.
  • When the replication factor is increased (At the cluster level or for a particular file), NameNode identifies the data nodes to which the data blocks need to be replicated and initiates the replication.
  • When the replication factor is reduced (At the cluster level or for a particular file), NameNode identifies the data nodes from which the data blocks need to be deleted and initiates the deletion.

General

Here are few general highlights about HDFS.
  • HDFS is implemented in Java and any computer which can run Java can host a NameNode/DataNode on it.
  • Designed to be portable across all the major hardware and software platforms.
  • Basic file operations include reading and writing of files, creation, deletion, and replication of files, etc.
  • Provides necessary interfaces which enable the movement of computation to the data unlike in traditional data processing systems where the data moves to computation.
  • HDFS provides a command line interface called "FS Shell" used to interact with HDFS.

When to Use HDFS (HDFS Use Cases)

There are many use cases for HDFS including the following:
  • HDFS can be used for storing the Big Data sets for further processing.
  • HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of fault-tolerance.

When Not to Use HDFS

There are certain scenarios in which HDFS may not be a good fit including the following:
  • HDFS is not suitable for storing data related to applications requiring low latency data access.
  • HDFS is not suitable for storing a large number of small files as the metadata for each file needs to be stored on the NameNode and is held in memory.
  • HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file.

References

Next Steps

  • Explore more about Big Data and Hadoop
    • Big Data Basics - Part 1 - Introduction to Big Data
    • Big Data Basics - Part 2 - Overview of Big Data Architecture
    • Big Data Basics - Part 3 - Overview of Hadoop
    • Compare Big Data Platforms vs SQL Server
  • In the next and subsequent tips, we will look at the other aspects of Hadoop and the Big Data world. So stay tuned!


 

Big Data Basics - Part 3 - Overview of Hadoop

Big Data Basics - Part 3 - Overview of Hadoop

Problem

I have read the previous tips on Introduction to Big Data and Architecture of Big Data and I would like to know more about Hadoop.  What are the core components of the Big Data ecosystem? Check out this tip to learn more.

Solution

Before we look into the architecture of Hadoop, let us understand what Hadoop is and a brief history of Hadoop.

What is Hadoop?

Hadoop is an open source framework, from the Apache foundation, capable of processing large amounts of heterogeneous data sets in a distributed fashion across clusters of commodity computers and hardware using a simplified programming model. Hadoop provides a reliable shared storage and analysis system.

The Hadoop framework is based closely on the following principle:
In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. ~Grace Hopper

History of Hadoop

Hadoop was created by Doug Cutting and Mike Cafarella. Hadoop has originated from an open source web search engine called "Apache Nutch", which is part of another Apache project called "Apache Lucene", which is a widely used open source text search library.

The name Hadoop is a made-up name and is not an acronym. According to Hadoop's creator Doug Cutting, the name came about as follows.

"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term."

Architecture of Hadoop

Below is a high-level architecture of multi-node Hadoop Cluster.

Architecture of Multi-Node Hadoop Cluster

Here are few highlights of the Hadoop Architecture:
  • Hadoop works in a master-worker / master-slave fashion.
  • Hadoop has two core components: HDFS and MapReduce.
  • HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, even on a commodity hardware, by replicating the data across multiple nodes. Unlike a regular file system, when data is pushed to HDFS, it will automatically split into multiple blocks (configurable parameter) and stores/replicates the data across various datanodes. This ensures high availability and fault tolerance.
  • MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes and takes care of coordination and consolidation of results.
  • The master contains the Namenode and Job Tracker components.
    • Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of the Hadoop Cluster.
    • Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.
  • Each Worker / Slave contains the Task Tracker and a Datanode components.
    • Task Tracker is responsible for running the task / computation assigned to it.
    • Datanode is responsible for holding the data.
  • The computers present in the cluster can be present in any location and there is no dependency on the location of the physical server.

Characteristics of Hadoop

Here are the prominent characteristics of Hadoop:
  • Hadoop provides a reliable shared storage (HDFS) and analysis system (MapReduce).
  • Hadoop is highly scalable and unlike the relational databases, Hadoop scales linearly. Due to linear scale, a Hadoop Cluster can contain tens, hundreds, or even thousands of servers.
  • Hadoop is very cost effective as it can work with commodity hardware and does not require expensive high-end hardware.
  • Hadoop is highly flexible and can process both structured as well as unstructured data.
  • Hadoop has built-in fault tolerance. Data is replicated across multiple nodes (replication factor is configurable) and if a node goes down, the required data can be read from another node which has the copy of that data. And it also ensures that the replication factor is maintained, even if a node goes down, by replicating the data to other available nodes.
  • Hadoop works on the principle of write once and read multiple times.
  • Hadoop is optimized for large and very large data sets. For instance, a small amount of data like 10 MB when fed to Hadoop, generally takes more time to process than traditional systems.

When to Use Hadoop (Hadoop Use Cases)

Hadoop can be used in various scenarios including some of the following:
  • Analytics
  • Search
  • Data Retention
  • Log file processing
  • Analysis of Text, Image, Audio, & Video content
  • Recommendation systems like in E-Commerce Websites

When Not to Use Hadoop

There are few scenarios in which Hadoop is not the right fit. Following are some of them:
  • Low-latency or near real-time data access.
  • If you have a large number of small files to be processed. This is due to the way Hadoop works. Namenode holds the file system metadata in memory and as the number of files increases, the amount of memory required to hold the metadata increases.
  • Multiple writes scenario or scenarios requiring arbitrary writes or writes between the files.
For more information on Hadoop framework and the features of the latest Hadoop release, visit the Apache Website: http://hadoop.apache.org.
There are few other important projects in the Hadoop ecosystem and these projects help in operating/managing Hadoop, Interacting with Hadoop, Integrating Hadoop with other systems, and Hadoop Development. We will take a look at these items in the subsequent tips.
Next Steps
  • Explore more about Big Data and Hadoop
  • In the next and subsequent tips, we will see what is HDFS, MapReduce, and other aspects of Big Data world. So stay tuned!

 

Related Posts Plugin for WordPress, Blogger...