Thursday, 31 July 2014

Big Data Basics - Part 3 - Overview of Hadoop

Big Data Basics - Part 3 - Overview of Hadoop

Problem

I have read the previous tips on Introduction to Big Data and Architecture of Big Data and I would like to know more about Hadoop.  What are the core components of the Big Data ecosystem? Check out this tip to learn more.

Solution

Before we look into the architecture of Hadoop, let us understand what Hadoop is and a brief history of Hadoop.

What is Hadoop?

Hadoop is an open source framework, from the Apache foundation, capable of processing large amounts of heterogeneous data sets in a distributed fashion across clusters of commodity computers and hardware using a simplified programming model. Hadoop provides a reliable shared storage and analysis system.

The Hadoop framework is based closely on the following principle:
In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. ~Grace Hopper

History of Hadoop

Hadoop was created by Doug Cutting and Mike Cafarella. Hadoop has originated from an open source web search engine called "Apache Nutch", which is part of another Apache project called "Apache Lucene", which is a widely used open source text search library.

The name Hadoop is a made-up name and is not an acronym. According to Hadoop's creator Doug Cutting, the name came about as follows.

"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term."

Architecture of Hadoop

Below is a high-level architecture of multi-node Hadoop Cluster.

Architecture of Multi-Node Hadoop Cluster

Here are few highlights of the Hadoop Architecture:
  • Hadoop works in a master-worker / master-slave fashion.
  • Hadoop has two core components: HDFS and MapReduce.
  • HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, even on a commodity hardware, by replicating the data across multiple nodes. Unlike a regular file system, when data is pushed to HDFS, it will automatically split into multiple blocks (configurable parameter) and stores/replicates the data across various datanodes. This ensures high availability and fault tolerance.
  • MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes and takes care of coordination and consolidation of results.
  • The master contains the Namenode and Job Tracker components.
    • Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of the Hadoop Cluster.
    • Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.
  • Each Worker / Slave contains the Task Tracker and a Datanode components.
    • Task Tracker is responsible for running the task / computation assigned to it.
    • Datanode is responsible for holding the data.
  • The computers present in the cluster can be present in any location and there is no dependency on the location of the physical server.

Characteristics of Hadoop

Here are the prominent characteristics of Hadoop:
  • Hadoop provides a reliable shared storage (HDFS) and analysis system (MapReduce).
  • Hadoop is highly scalable and unlike the relational databases, Hadoop scales linearly. Due to linear scale, a Hadoop Cluster can contain tens, hundreds, or even thousands of servers.
  • Hadoop is very cost effective as it can work with commodity hardware and does not require expensive high-end hardware.
  • Hadoop is highly flexible and can process both structured as well as unstructured data.
  • Hadoop has built-in fault tolerance. Data is replicated across multiple nodes (replication factor is configurable) and if a node goes down, the required data can be read from another node which has the copy of that data. And it also ensures that the replication factor is maintained, even if a node goes down, by replicating the data to other available nodes.
  • Hadoop works on the principle of write once and read multiple times.
  • Hadoop is optimized for large and very large data sets. For instance, a small amount of data like 10 MB when fed to Hadoop, generally takes more time to process than traditional systems.

When to Use Hadoop (Hadoop Use Cases)

Hadoop can be used in various scenarios including some of the following:
  • Analytics
  • Search
  • Data Retention
  • Log file processing
  • Analysis of Text, Image, Audio, & Video content
  • Recommendation systems like in E-Commerce Websites

When Not to Use Hadoop

There are few scenarios in which Hadoop is not the right fit. Following are some of them:
  • Low-latency or near real-time data access.
  • If you have a large number of small files to be processed. This is due to the way Hadoop works. Namenode holds the file system metadata in memory and as the number of files increases, the amount of memory required to hold the metadata increases.
  • Multiple writes scenario or scenarios requiring arbitrary writes or writes between the files.
For more information on Hadoop framework and the features of the latest Hadoop release, visit the Apache Website: http://hadoop.apache.org.
There are few other important projects in the Hadoop ecosystem and these projects help in operating/managing Hadoop, Interacting with Hadoop, Integrating Hadoop with other systems, and Hadoop Development. We will take a look at these items in the subsequent tips.
Next Steps
  • Explore more about Big Data and Hadoop
  • In the next and subsequent tips, we will see what is HDFS, MapReduce, and other aspects of Big Data world. So stay tuned!

 

Big Data Basics - Part 2 - Overview of Big Data Architecture

Big Data Basics - Part 2 - Overview of Big Data Architecture

Problem

I read the tip on Introduction to Big Data and would like to know more about how Big Data architecture looks in an enterprise, what are the scenarios in which Big Data technologies are useful, and any other relevant information.

Solution

In this tip, let us take a look at the architecture of a modern data processing and management system involving a Big Data ecosystem, a few use cases of Big Data, and also some of the common reasons for the increasing adoption of Big Data technologies.

Architecture

Before we look into the architecture of Big Data, let us take a look at a high level architecture of a traditional data processing management system. It looks as shown below.

Traditional Data Processing and Management Architecture

As we can see in the above architecture, mostly structured data is involved and is used for Reporting and Analytics purposes. Although there are one or more unstructured sources involved, often those contribute to a very small portion of the overall data and hence are not represented in the above diagram for simplicity. However, in the case of Big Data architecture, there are various sources involved, each of which is comes in at different intervals, in different formats, and in different volumes. Below is a high level architecture of an enterprise data management system with a Big Data engine.

Big Data Processing and Management Architecture
Let us take a look at various components of this modern architecture.

Source Systems

As discussed in the previous tip, there are various different sources of Big Data including Enterprise Data, Social Media Data, Activity Generated Data, Public Data, Data Archives, Archived Files, and other Structured or Unstructured sources.

Transactional Systems

In an enterprise, there are usually one or more Transactional/OLTP systems which act as the backend databases for the enterprise's mission critical applications. These constitute the transactional systems represented above.

Data Archive

Data Archive is collection of data which includes the data archived from the transactional systems in compliance with an organization's data retention and data governance policies, and aggregated data (which is less likely to be needed in the near future) from a Big Data engine etc.

ODS

Operational Data Store is a consolidated set of data from various transactional systems. This acts as a staging data hub and can be used by a Big Data Engine as well as for feeding the data into Data Warehouse, Business Intelligence, and Analytical systems.

Big Data Engine

This is the heart of modern (Next-Generation / Big Data) data processing and management system architecture. This engine capable of processing large volumes of data ranging from a few Megabytes to hundreds of Terabytes or even Petabytes of data of different varieties, structured or unstructured, coming in at different speeds and/or intervals. This engine consists primarily of a Hadoop framework, which allows distributed processing of large heterogeneous data sets across clusters of computers. This framework consists of two main components, namely HDFS and MapReduce. We will take a closer look at this framework and its components in the next and subsequent tips.

Big Data Use Cases

Big Data technologies can solve the business problems in a wide range of industries. Below are a few use cases.
  • Banking and Financial Services
    • Fraud Detection to detect the possible fraud or suspicious transactions in Accounts, Credit Cards, Debit Cards, and Insurance etc.
  • Retail
    • Targeting customers with different discounts, coupons, and promotions etc. based on demographic data like gender, age group, location, occupation, dietary habits, buying patterns, and other information which can be useful to differentiate/categorize the customers.
  • Marketing
    • Specifically outbound marketing can make use of customer demographic information like gender, age group, location, occupation, and dietary habits, customer interests/preferences usually expressed in the form of comments/feedback and on social media networks.
    • Customer's communication preferences can be identified from various sources like polls, reviews, comments/feedback, and social media etc. and can be used to target customers via different channels like SMS, Email, Online Stores, Mobile Applications, and Retail Stores etc.
  • Sentiment Analysis
    • Organizations use the data from social media sites like Facebook, Twitter etc. to understand what customers are saying about the company, its products, and services. This type of analysis is also performed to understand which companies, brands, services, or technologies people are talking about.
  • Customer Service
    • IT Services and BPO companies analyze the call records/logs to gain insights into customer complaints and feedback, call center executive response/ability to resolve the ticket, and to improve the overall quality of service.
    • Call center data from telecommunications industries can be used to analyze the call records/logs and optimize the price, and calling, messaging, and data plans etc.
Apart from these, Big Data technologies/solutions can solve the business problems in other industries like Healthcare, Automobile, Aeronautical, Gaming, and Manufacturing etc.

Big Data Adoption

Data has always been there and is growing at a rapid pace. One question being asked quite often is "Why are organizations taking interest in the silos of data, which otherwise was not utilized effectively in the past, and embracing Big Data technologies today?". The reason for adoption of Big Data technologies is due to various factors including the following:
  • Cost Factors
    • Availability of Commodity Hardware
    • Availability of Open Source Operating Systems
    • Availability of Cheaper Storage
    • Availability of Open Source Tools/Software
  • Business Factors
    • There is lot of data being generated outside the enterprise and organizations are compelled to consume that data to stay ahead of the competition. Often organizations are interested in a subset of this large volume of data.
    • The volume of structured and unstructured data being generated in the enterprise is very large and cannot be effectively handled using the traditional data management and processing tools.
References
Next Steps
  • Explore more Big Data use cases
  • Stay tuned for next tips in this series to learn more about Big Data ecosystem

 

Related Posts Plugin for WordPress, Blogger...