Introduction to Hadoop Zookeeperedureka!
Apache ZooKeeper Introduction
Hadoop’s distributed coordination service is called ZooKeeper.
what ZooKeeper does
do is give you a set of tools to build distributed applications that can safely handle
ZooKeeper also has the following characteristics:
ZooKeeper is simple
ZooKeeper is, at its core, a stripped-down filesystem that exposes a few simple
operations, and some extra abstractions such as ordering and notifications.
ZooKeeper is expressive
The ZooKeeper primitives are a rich set of building blocks that can be used to build
a large class of coordination data structures and protocols.
Group Membership in ZooKeeper
One way of understanding ZooKeeper is to think of it as providing a high-availability
filesystem. It doesn’t have files and directories, but a unified concept of a node, called
a znode, which acts both as a container of data (like a file) and a container of other
znodes (like a directory).
Joining a Group
The next part of the application is a program to register a member in a group. Each
member will run as a program and join a group. When the program exits, it should be
removed from the group, which we can do by creating an ephemeral znode that represents
it in the ZooKeeper namespace.
ZooKeeper command-line tools
ZooKeeper maintains a hierarchical tree of nodes called znodes. A znode stores data
and has an associated ACL. ZooKeeper is designed for coordination (which typically
uses small data files), not high-volume data storage, so there is a limit of 1 MB on the
amount of data that may be stored in any znode.
Building Applications with ZooKeeper
A Configuration Service
One of the most basic services that a distributed application needs is a configuration
service so that common pieces of configuration information can be shared by machines
in a cluster. At the simplest level, ZooKeeper can act as a highly available store for
configuration, allowing application participants to retrieve or update configuration
files. Using ZooKeeper watches, it is possible to create an active configuration service,
where interested clients are notified of changes in configuration.
Recoverable exceptions are those from which the application can
recover within the same ZooKeeper session. A recoverable exception is manifested by
KeeperException.ConnectionLossException, which means that the connection to
ZooKeeper has been lost. ZooKeeper will try to reconnect, and in most cases the reconnection
will succeed and ensure that the session is intact.
A Lock Service
A distributed lock is a mechanism for providing mutual exclusion between a collection
of processes. At any one time, only a single process may hold the lock. Distributed locks
can be used for leader election in a large distributed system, where the leader is the
process that holds the lock at any point in time.
More Distributed Data Structures and Protocols
There are many distributed data structures and protocols that can be built with Zoo-
Keeper, such as barriers, queues, and two-phase commit. One interesting thing to note
is that these are synchronous protocols, even though we use asynchronous ZooKeeper
primitives (such as notifications) to build them.
Resilience and Performance
ZooKeeper machines should be located to minimize the impact of machine and network
failure. In practice, this means that servers should be spread across racks, power supplies,
and switches, so that the failure of any one of these does not cause the ensemble
to lose a majority of its servers.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
Hadoop Notes and Video Lectures
What is Hadoop? Text and Video Lectures
What is MapReduce? Text and Video Lectures
The Hadoop Distributed Filesystem (HDFS)
Hadoop Input - Output System