Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Showing posts with label Zookeeper. Show all posts

Monday, 16 November 2015

How to Install Zookeeper in Ubuntu @ Kalyan Hadoop Training in Hyderabad

Sunday, 11 October 2015

Setting up Apache ZooKeeper Cluster

This tutorial provides step by step instructions to configure and start up Apache ZooKeeper 3.4.6 Multi-node cluster.

Pre-requisites for zookeeper

First thing that we would need in order to install Apache ZooKeeper are multiple machines.

In this tutorial, We will be utilizing following virtual machines to install Apache ZooKeeper -

Parameter Name	Server 1	Server 2	Server 3
Host Name	node1	node2	node3
IP Address	192.168.0.1	192.168.0.2	192.168.0.3
Operating System	Ubuntu	Ubuntu	Ubuntu
No of CPU Cores	4	4	4
RAM	6 GB	6 GB	6 GB

Apart from above machines, please ensure that the following pre-requisites have been fulfilled to ensure that you are able to follow this article without any issues-

JDK 6 or higher installed on all the virtual machines
JAVA_HOME variable set to the path where JDK is installed
Root access on all the virtual machines as all the steps should ideally be performed by root user
Updated /etc/hosts file on both the Servers with below details

192.168.0.1 node1
192.168.0.2 node2
192.168.0.3 node3

Installing Apache Zookeeper

First step to install Apache ZooKeeper is to download its binaries on both the Servers. In this article, we will be installing Apache ZooKeeper 3.4.6 to set up cluster which can be downloaded from here.

Once the libraries have been downloaded on the Servers, you can extract it to a directory where you would like ZooKeeper to be installed. We will refer this directory as $ZOOKEEPER_HOME throughout this tutorial.

Configuring Multi-node Cluster
Once Apache ZooKeeper has been extracted on all the Servers, next step is to configure these.

We don't need to mark any node as Leader node during configuration as the leader is automatically chosen by ZooKeeper service. So, configuration for all the nodes will be same.

First part of configuration involves creating/updating a configuration file called zoo.cfg in $ZOOKEEPER_HOME/conf directory with following contents:

ZooKeeper Configuration - $ZOOKEEPER_HOME/conf/zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
# Where you would like ZooKeeper to save its data
dataDir=$ZOOKEEPER_HOME/data
# Where you would like ZooKeeper to log
dataLogDir=$ZOOKEEPER_HOME/logs
clientPort=2181
server.1=192.168.0.1:2888:3888
server.2=192.168.0.2:2888:3888
server.3=192.168.0.3:2888:3888

First thing that you would need to do in above zoo.cfg file is to replace the value of dataDir and dataLogDir with the directory where you would like ZooKeeper to save its data and log respectively. Now, let's talk about some of the important parts of above configuration.

clientPort property, as the name suggests, is for the clients to connect to ZooKeeper Service.

Next let's talk about the last two entries in server.x=hostname:port1:port2 format.

Firstly, there are two port numbers port1(2888) and port2(3888). The first followers use to connect to the leader, and the second is for leader election.

Secondly, x in server.x denotes the id of node.

Each server.x row must have unique id. Each server is assigned an id by creating a file named myid, one for each server, which resides in that server's data directory, as specified by the configuration file parameter dataDir.

The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text 1 and nothing else.

The id must be unique within the ensemble and should have a value between 1 and 255.

Starting Up Multi-node Cluster

Once you are all set up, next step is to start the cluster.

On all the Servers, go to bin directory of Apache ZooKeeper and execute the following commands -

ZOOKEEPER_HOME/bin on all machines

./zkServer.sh start

You can execute the follow command to check the status of Apache ZooKeeper -

ZOOKEEPER_HOME/bin on all machines

./zkServer.sh status

Stopping Multi-node Cluster

In order to stop Apache ZooKeeper, execute the following command on all the Servers -

$ZOOKEEPER_HOME/bin on all machines

./zkServer.sh stop

Saturday, 10 October 2015

How ZooKeeper Works

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers, known as znodes. Every znode is identified by a path, with path elements separated by a slash (“/”). Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.

This is much like a normal file system, but ZooKeeper provides superior reliability through redundant services. A service is replicated over a set of machines and each maintains an in-memory image of the the data tree and transaction logs. Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send requests and receive responses.

This architecture allows ZooKeeper to provide high throughput and availability with low latency, but the size of the database that ZooKeeper can manage is limited by memory

What ZooKeeper Does

ZooKeeper provides a very simple interface and services. ZooKeeper brings these key benefits:

Fast.ZooKeeper is especially fast with workloads where reads to the data are more common than writes. The ideal read/write ratio is about 10:1.
Reliable.ZooKeeper is replicated over a set of hosts (called an ensemble) and the servers are aware of each other. As long as a critical mass of servers is available, the ZooKeeper service will also be available. There is no single point of failure.
Simple.ZooKeeper maintain a standard hierarchical name space, similar to files and directories.
Ordered.The service maintains a record of all transactions, which can be used for higher-level abstractions, like synchronization primitives.

What is ZooKeeper?

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Learn more about ZooKeeper on the ZooKeeper Wiki.

Introduction to Apache ZooKeeper

This article introduces you to Apache ZooKeeper by providing details of its technical architecture. It also talks about its benefits along with the use cases it could be utilized in.

Abstract

ZooKeeper, at its core, provides an API to let you manage your application state in a highly read-dominant concurrent and distributed environment. It is optimized for and performs well in the scenario where read operations greatly outnumber write operations.

As Apache defines it, ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.

It is implemented in Java and has bindings for both Java and C. It uses tree structure of file systems to manage its data among its nodes.

Technical Architecture

It is time to discuss Technical Architecture of ZooKeeper. Following diagram depicts the architecture of ZooKeeper

There are following two types of nodes shown in above diagram -

Leader Node - Leader Node is the only node responsible for processing the write requests. All other nodes called followers simply delegate the client write calls to Leader node.We don't mark any node as leader while setting up Apache ZooKeeper cluster. It instead is elected internally among all the nodes of cluster. Apache ZooKeeper uses the concept of majority for same i.e. Node that gets highest number of votes is elected as Leader.
This serves as the basis of recommendation that suggests to have odd number of nodes in a cluster for best failover and availability. E.g. if we create the cluster of four nodes and two nodes go offline for some reason. Apache ZooKeeper will be down as half of the nodes have gone offline as it is not possible to gain majority for Leader node election. However if we create the cluster of five nodes, even if two nodes go offline, Apache ZooKeeper will still be functional as we still have majority of nodes in service.
Follower Nodes - All nodes other than Leader are called Follower Nodes. A follower node is capable of servicing read requests on its own. For write requests, it gets it done through Leader Node. Followers also play an important role of electing a new leader if existing leader node goes down.

And here is the brief description of Node components as shown in architecture diagram. Please note that these are not the only components in Nodes.

Request Processor - This component is only active in Leader Node and is responsible for processing write requests originating from client or follower nodes. Once request processor proesses the write request, it broadcasts the changes to follower nodes so that they can update their state accordingly.
Atomic Broadcast -This component is present in both Leader Node and Follower Nodes. This component is responsible for broadcasting the changes to other nodes (in Leader Node) as well as receiving the change notification (in Follower Nodes).
In-memory Database (Replicated Database) - This in-memory and replicated database is responsible for storing the data in ZooKeeper. Every node contains its own database that enables them to server read requests. In addition to this, data is also written to file system providing recoverability in case of any problems with cluster. In case of write requests, in-memory database is updated only after it has successfully been written to file system.

Benefits

Apache ZooKeeper can help you reap following benefits if the applications utilize it for the right cases (please see next section on this) -

Simple Design
Fast Processing
Data Replication
Atomic and Ordered Updates
Reliability

Possible Use Cases

Apache ZooKeeper, being a coordination service, is suitable for but not limited to following scenarios -

Synchronizations primitives such as Barriers, Queues for the distributed environment
Multi-machines cluster management
Coordination and failure recovery service
Automatic leader selection

Below are some of instances where Apache ZooKeeper is being utilized -

Apache Storm, being a real time stateless processing/computing framework, manages its state in ZooKeeper Service
Apache Kafka uses it for choosing leader node for the topic partitions
Apache YARN relies on it for the automatic failover of resource manager (master node)
Yahoo! utilties it as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery.

References

Apache ZooKeeper

Thank you for reading through the tutorial. In case of any questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

Pages

Monday, 16 November 2015

How to Install Zookeeper in Ubuntu @ Kalyan Hadoop Training in Hyderabad

Sunday, 11 October 2015

Setting up Apache ZooKeeper Cluster

tickTime=2000initLimit=10syncLimit=5# Where you would like ZooKeeper to save its datadataDir=$ZOOKEEPER_HOME/data# Where you would like ZooKeeper to logdataLogDir=$ZOOKEEPER_HOME/logsclientPort=2181server.1=192.168.0.1:2888:3888server.2=192.168.0.2:2888:3888server.3=192.168.0.3:2888:3888

Saturday, 10 October 2015

How ZooKeeper Works

How ZooKeeper Works

What ZooKeeper Does

What ZooKeeper Does

What is ZooKeeper?

Introduction to Apache ZooKeeper

tickTime=2000
initLimit=10
syncLimit=5
# Where you would like ZooKeeper to save its data
dataDir=$ZOOKEEPER_HOME/data
# Where you would like ZooKeeper to log
dataLogDir=$ZOOKEEPER_HOME/logs
clientPort=2181
server.1=192.168.0.1:2888:3888
server.2=192.168.0.2:2888:3888
server.3=192.168.0.3:2888:3888