Hadoop Course Content
(Development and Administration)
Introduction to Big Data and Hadoop
v Big Data
n What is Big Data?
n Why all industries are talking about Big Data?
n What are the issues in Big Data?
§ Storage
§ What are the challenges for storing big data?
§ Processing
§ What are the challenges for processing big data?
n What are the technologies support big data?
§ Hadoop
§ Data Bases
§ Traditional
§ NO SQL
v Hadoop
n What is Hadoop?
n History of Hadoop
n Why Hadoop?
n Hadoop Use cases
n Advantages and Disadvantages of Hadoop
v Importance of Different Ecosystems of Hadoop
v Importance of Integration with other BigData solutions
v Big Data Real time Use Cases
HDFS (Hadoop Distributed File System)
v HDFS architecture
o Name Node
§ Importance of Name Node
§ What are the roles of Name Node
§ What are the drawbacks in Name Node
o Secondary Name Node
§ Importance of Secondary Name Node
§ What are the roles of Secondary Name Node
§ What are the drawbacks in Secondary Name Node
o Data Node
§ Importance of Data Node
§ What are the roles of Data Node
§ What are the drawbacks in Data Node
v Data Storage in HDFS
o How blocks are storing in DataNodes
o How replication works in Data Nodes
o How to write the files in HDFS
o How to read the files in HDFS
v HDFS Block size
o Importance of HDFS Block size
o Why Block size is so large?
o How it is related to MapReduce split size
v HDFS Replication factor
o Importance of HDFS Replication factor in production environment
o Can we change the replication for a particular file or folder
o Can we change the replication for all files or folders
v Accessing HDFS
o CLI(Command Line Interface) using hdfs commands
o Java Based Approach
v HDFS Commands
o Importance of each command
o How to execute the command
o Hdfs admin related commands explanation
v Configurations
o Can we change the existing configurations of hdfs or not?
o Importance of configurations
v How to overcome the Drawbacks in HDFS
o Name Node failures
o Secondary Name Node failures
o Data Node failures
v Where does it fit and Where doesn’t fit?
v Exploring the Apache HDFS Web UI
v How to configure the Hadoop Cluster
o How to add the new nodes ( Commissioning )
o How to remove the existing nodes ( De-Commissioning )
o How to verify the Dead Nodes
o How to start the Dead Nodes
v Hadoop 2.x.x version features
o Introduction to Namenode fedoration
o Introduction to Namenode High Availabilty
v Difference between Hadoop 1.x.x and Hadoop 2.x.x versions
MAPREDUCE
v Map Reduce architecture
o JobTracker
§ Importance of JobTracker
§ What are the roles of JobTracker
§ What are the drawbacks in JobTracker
o TaskTracker
§ Importance of TaskTracker
§ What are the roles of TaskTracker
§ What are the drawbacks in TaskTracker
o Map Reduce Job execution flow
v Data Types in Hadoop
o What are the Data types in Map Reduce
o Why these are importance in Map Reduce
o Can we write custom Data Types in MapReduce
v Input Format's in Map Reduce
o Text Input Format
o Key Value Text Input Format
o Sequence File Input Format
o NLine Input Format
o Importance of Input Format in Map Reduce
o How to use Input Format in Map Reduce
o How to write custom Input Format's and its Record Readers
v Output Format's in Map Reduce
o Text Output Format
o Sequence File Output Format
o Importance of Output Format in Map Reduce
o How to use Output Format in Map Reduce
o How to write custom Output Format's and its Record Writers
v Mapper
o What is mapper in Map Reduce Job
o Why we need mapper?
o What are the Advantages and Disadvantages of mapper
o Writing mapper programs
v Reducer
o What is reducer in Map Reduce Job
o Why we need reducer ?
o What are the Advantages and Disadvantages of reducer
o Writing reducer programs
v Combiner
o What is combiner in Map Reduce Job
o Why we need combiner?
o What are the Advantages and Disadvantages of Combiner
o Writing Combiner programs
v Partitioner
o What is Partitioner in Map Reduce Job
o Why we need Partitioner?
o What are the Advantages and Disadvantages of Partitioner
o Writing Partitioner programs
v Distributed Cache
o What is Distributed Cache in Map Reduce Job
o Importance of Distributed Cache in Map Reduce job
o What are the Advantages and Disadvantages of Distributed Cache
o Writing Distributed Cache programs
v Counters
o What is Counter in Map Reduce Job
o Why we need Counters in production environment?
o How to Write Counters in Map Reduce programs
v Importance of Writable and Writable Comparable Api’s
o How to write custom Map Reduce Keys using Writable
o How to write custom Map Reduce Values using Writable Comparable
v Joins
o Map Side Join
§ What is the importance of Map Side Join
§ Where we are using it
o Reduce Side Join
§ What is the importance of Reduce Side Join
§ Where we are using it
o What is the difference between Map Side join and Reduce Side Join?
v Compression techniques
o Importance of Compression techniques in production environment
o Compression Types
§ NONE, RECORD and BLOCK
o Compression Codecs
§ Default, Gzip, Bzip, Snappy and LZO
o Enabling and Disabling these techniques for all the Jobs
o Enabling and Disabling these techniques for a particular Job
v Map Reduce Schedulers
o FIFO Scheduler
o Capacity Scheduler
o Fair Scheduler
o Importance of Schedulers in production environment
o How to use Schedulers in production environment
v Map Reduce Programming Model
o How to write the Map Reduce jobs in Java
o Running the Map Reduce jobs in local mode
o Running the Map Reduce jobs in pseudo mode
o Running the Map Reduce jobs in cluster mode
v Debugging Map Reduce Jobs
o How to debug Map Reduce Jobs in Local Mode.
o How to debug Map Reduce Jobs in Remote Mode.
v YARN (Next Generation Map Reduce)
o What is YARN?
o What is the importance of YARN?
o Where we can use the concept of YARN in Real Time
o What is difference between YARN and Map Reduce
v Data Locality
o What is Data Locality?
o Will Hadoop follows Data Locality?
v Speculative Execution
o What is Speculative Execution?
o Will Hadoop follows Speculative Execution?
v Map Reduce Commands
o Importance of each command
o How to execute the command
o Mapreduce admin related commands explanation
v Configurations
o Can we change the existing configurations of mapreduce or not?
o Importance of configurations
v Writing Unit Tests for Map Reduce Jobs
v Configuring hadoop development environment using Eclipse
v Use of Secondary Sorting and how to solve using MapReduce
v How to Identify Performance Bottlenecks in MR jobs and tuning MR jobs.
v Map Reduce Streaming and Pipes with examples
v Exploring the Apache MapReduce Web UI
Apache PIG
v Introduction to Apache Pig
v Map Reduce Vs Apache Pig
v SQL Vs Apache Pig
v Different data types in Pig
v Modes Of Execution in Pig
o Local Mode
o Map Reduce Mode
v Execution Mechanism
o Grunt Shell
o Script
o Embedded
v UDF's
o How to write the UDF's in Pig
o How to use the UDF's in Pig
o Importance of UDF's in Pig
v Filter's
o How to write the Filter's in Pig
o How to use the Filter's in Pig
o Importance of Filter's in Pig
v Load Functions
o How to write the Load Functions in Pig
o How to use the Load Functions in Pig
o Importance of Load Functions in Pig
v Store Functions
o How to use the Store Functions in Pig
o Importance of Store Functions in Pig
v Transformations in Pig
v How to write the complex pig scripts
v How to integrate the Pig and Hbase
Apache HIVE
v Hive Introduction
v Hive architecture
o Driver
o Compiler
o Semantic Analyzer
v Hive Integration with Hadoop
v Hive Query Language(Hive QL)
v SQL VS Hive QL
v Hive Installation and Configuration
v Hive, Map-Reduce and Local-Mode
v Hive DLL and DML Operations
v Hive Services
o CLI
o Hiveserver
o Hwi
v Metastore
o embedded metastore configuration
o external metastore configuration
v UDF's
o How to write the UDF's in Hive
o How to use the UDF's in Hive
o Importance of UDF's in Hive
v UDAF's
o How to use the UDAF's in Hive
o Importance of UDAF's in Hive
v UDTF's
o How to use the UDTF's in Hive
o Importance of UDTF's in Hive
v How to write a complex Hive queries
v What is Hive Data Model?
v Partitions
o Importance of Hive Partitions in production environment
o Limitations of Hive Partitions
o How to write Partitions
v Buckets
o Importance of Hive Buckets in production environment
o How to write Buckets
v SerDe
o Importance of Hive SerDe's in production environment
o How to write SerDe programs
v How to integrate the Hive and Hbase
Apache Zookeeper
v Introduction to zookeeper
v Pseudo mode installations
v Zookeeper cluster installations
v Basic commands execution
Apache HBase
v HBase introduction
v HBase usecases
v HBase basics
o Column families
o Scans
v HBase installation
o Local mode
o Psuedo mode
o Cluster mode
v HBase Architecture
o Storage
o WriteAhead Log
o Log Structured MergeTrees
v Mapreduce integration
o Mapreduce over HBase
v HBase Usage
o Key design
o Bloom Filters
o Versioning
o Coprocessors
o Filters
v HBase Clients
o REST
o Thrift
o Hive
o Web Based UI
v HBase Admin
o Schema definition
o Basic CRUD operations
Apache SQOOP
v Introduction to Sqoop
v MySQL client and Server Installation
v Sqoop Installation
v How to connect to Relational Database using Sqoop
v Sqoop Commands and Examples on Import and Export commands
Apache FLUME
v Introduction to flume
v Flume installation
v Flume agent usage and Flume examples execution
Apache OOZIE
v Introduction to oozie
v Oozie installation
v Executing oozie workflow jobs
v Monitering Oozie workflow jobs
Apache Mahout
v Introduction to mahout
v Mahout installation
v Mahout examples
Apache Cassandra
v Introduction to Cassandra
v Cassandra examples
Storm
v Introduction to Storm
v Storm examples
MongoDB
v Introduction to MongoDB
v MongoDB installation
v MongoDB examples
Apache Nutch
v Introduction to Nutch
v Nutch Installation
v Nutch Examples
Cloudera Distribution
v Introduction to Cloudera
v Cloudera Installation
v Cloudera Certification details
v How to use cloudera hadoop
v What are the main differences between Cloudera and Apache hadoop
Hortonworks Distribution
v Introduction to Hortonworks
v Hortonworks Installation
v Hortonworks Certification details
v How to use Hortonworks hadoop
v What are the main differences between Hortonworks and Apache hadoop
Amazon EMR
v Introduction to Amazon EMR and Amazon EC2
v How to use Amazon EMR and Amazon EC2
v Why to use Amazon EMR and Importance of this
Advanced and New technologies architectural discussions
v Mahout (Machine Learning Algorithms)
v Storm (Real time data streaming)
v Cassandra (NOSQL database)
v MongoDB (NOSQL database)
v Solr (Search engine)
v Nutch (Web Crawler)
v Lucene (Indexing data)
v Ganglia, Nagios (Monitoring tools)
v Cloudera, Hortonworks, MapR, Amazon EMR (Distributions)
v How to crack the Cloudera certification questions
Pre-Requisites for this Course
· Java Basics like OOPS Concepts, Interfaces, Classes and Abstract Classes etc (Free Java classes as part of course)
· SQL Basic Knowledge ( Free SQL classes as part of course)
· Linux Basic Commands (Provided in our blog)
Adminstration topics:
· Hadoop Installations
o Local mode (hands on installation on ur laptop)
o Psuedo mode (hands on installation on ur laptop)
o Cluster mode (hands on 20 node cluster setup in our lab)
o Nodes Commissioning and De-commissioning in Hadoop Cluster
o Jobs Monitoring in Hadoop Cluster
o Fair Scheduler (hands on installation on ur laptop)
o Capacity Scheduler (hands on installation on ur laptop)
· Hive Installations
o Local mode (hands on installation on ur laptop)
§ With internal Derby
o Cluster mode (hands on installation on ur laptop)
§ With external Derby
§ With external MySql
o Hive Web Interface (HWI) mode (hands on installation on ur laptop)
o Hive Thrift Server mode (hands on installation on ur laptop)
o Derby Installation (hands on installation on ur laptop)
o MySql Installation (hands on installation on ur laptop)
· Pig Installations
o Local mode (hands on installation on ur laptop)
o Mapreduce mode (hands on installation on ur laptop)
· Hbase Installations
o Local mode (hands on installation on ur laptop)
o Psuedo mode (hands on installation on ur laptop)
o Cluster mode (hands on installation on ur laptop)
§ With internal Zookeeper
§ With external Zookeeper
· Zookeeper Installations
o Local mode (hands on installation on ur laptop)
o Cluster mode (hands on installation on ur laptop)
· Sqoop Installations
o Sqoop installation with MySql (hands on installation on ur laptop)
o Sqoop with hadoop integration (hands on installation on ur laptop)
o Sqoop with hive integration (hands on installation on ur laptop)
· Flume Installation
o Psuedo mode (hands on installation on ur laptop)
· Oozie Installation
o Psuedo mode (hands on installation on ur laptop)
· Mahout Installation
o Local mode (hands on installation on ur laptop)
o Psuedo mode (hands on installation on ur laptop)
· MongoDB Installation
o Psuedo mode (hands on installation on ur laptop)
· Nutch Installation
o Psuedo mode (hands on installation on ur laptop)
· Cloudera Hadoop Distribution installation
o Hadoop
o Hive
o Pig
o Hbase
o Hue
· HortonWorks Hadoop Distribution installation
o Hadoop
o Hive
o Pig
o Hbase
o Hue
Hadoop ecosystem Integrations:
o Hadoop and Hive Integration
o Hadoop and Pig Integration
o Hadoop and HBase Integration
o Hadoop and Sqoop Integration
o Hadoop and Oozie Integration
o Hadoop and Flume Integration
o Hive and Pig Integration
o Hive and HBase integration
o Pig and HBase integration
o Sqoop and RDBMS Integration
o Mahout and Hadoop Integration
What we are offering to you:
· Hands on MapReduce programming around 20+ programs these will make you to pefect in MapReduce both concept-wise and programatically
· Hands on 5 POC's will be provided (These POC's will help you perfect in Hadoop and it's ecosystems)
· Hands on 20 Node cluster setup in our Lab.
· Hands on installation for all the Hadoop and ecosystems in your laptop.
· Well documented Hadoop material with all the topics covering in the course
· Well documented Hadoop blog contains frequent interview questions along with the answers and latest updates on BigData technology.
· Real time projects explanation will be provided.
· Mock Interviews will be conducted on one-to-one basis.
· Discussing about hadoop interview questions daily base.
· Resume preparation with POC's or Project's based on your experiance.