Wednesday, May 11, 2016

Administering Hadoop - Introduction

Administering Hadoop

HDFS - Persistent Data Structures

Administrators of Hadoop must have a basic understanding of how the components
of HDFS—the namenode, the secondary namenode, and the datanodes—
organize their persistent data on disk. Knowing which files are which can help in
diagnosing problems or spotting the problem instances.

Audit Logging

HDFS has the ability to log all filesystem access requests, a feature that some organizations
require for auditing purposes. Audit logging is implemented using log4j logging
at the INFO level,


The dfsadmin tool is a multipurpose tool for finding information about the state of
HDFS, as well as performing administration operations on HDFS.

Filesystem check (fsck)
Hadoop provides an fsck utility for checking the health of files in HDFS. The tool looks
for blocks that are missing from all datanodes, as well as under- or over-replicated


Monitoring is an important part of system administration.
The purpose of monitoring is to detect when the cluster is not providing the expected
level of service.

All Hadoop daemons produce logfiles that can be very useful for finding out what is
happening in the system.

Getting stack traces
Hadoop daemons expose a web page (/stacks in the web UI) that produces a thread
dump for all running threads in the daemon’s JVM.

The HDFS and MapReduce daemons collect information about events and measurements
that are collectively known as metrics. Some metrics for example are the metrics collected by datanodes:
the number of bytes written, the number of blocks
replicated, and the number of read requests from clients (both local and remote).
Metrics belong to a context, and Hadoop currently uses “dfs”, “mapred”, “rpc”, and
“jvm” contexts. Hadoop daemons usually collect metrics under several contexts. For
example, datanodes collect metrics for the “dfs”, “rpc”, and “jvm” contexts.


Routine Administration Procedures
Metadata backups
If the namenode’s persistent metadata is lost or damaged, the entire filesystem is rendered
unusable, so it is critical that backups are made of these files. You should keep
multiple copies of different ages (one hour, one day, one week, and one month, say) to
protect against corruption, either in the copies themselves or in the live files running
on the namenode.

Data backups
Although HDFS is designed to store data reliably, data loss can occur, just like in any
storage system, and thus a backup strategy is essential.  The key is to prioritize data to be backed up. The highest priority is the data that cannot
be regenerated and that is critical to the business.

The distcp tool is ideal for making backups to other HDFS clusters  or
other Hadoop filesystems (such as S3 or KFS), since it can copy files in parallel. Alternatively,
 an entirely different storage system can be employed for backups, using one of
the ways to export data from HDFS.

Filesystem check (fsck)
It is advisable to run HDFS’s fsck tool regularly (for example, daily) on the whole filesystem
to proactively look for missing or corrupt blocks.

Filesystem balancer
Run the balancer tool (see “balancer” on page 304) regularly to keep the filesystem
datanodes evenly balanced.

Commissioning and Decommissioning Nodes

As an administrator of a Hadoop cluster, you will need to add or remove nodes from
time to time.


Excerpts from  Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly

More references

Hadoop Administration and Maintenance  (includes list of commands)

Building and Administering Hadoop Clusters

Fac. . Jordan Boyd Graber
2011 presentation

Avoiding Common Hadoop Administration Issues

August 12, 2010 By Jeff Bean

Demystifying Hadoop 2.0 - Part 1 | Hadoop Administration Tutorial | Hadoop Admin Tutorial Beginners



Hadoop Notes and Video Lectures

What is Hadoop? Text and Video Lectures

What is MapReduce? Text and Video Lectures

The Hadoop Distributed Filesystem (HDFS)

Hadoop Input - Output System

No comments:

Post a Comment