Computer Science and Engineering Knowledge Center: The Hadoop Distributed Filesystem (HDFS)

The Hadoop Distributed Filesystem

When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network-based, the complications of network programming have to be managed. Therefore, distributed file systems are more complex than regular disk file systems. One of the biggest challenges is making the filesystem tolerate node failure without suffering data loss.

Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.

HDFS Concepts

Blocks
A disk has a block size, which is the minimum amount of data that it can read or write. Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytesin size, while disk blocks are normally 512 bytes. HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default.
Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks,which are stored as independent units.

Namenodes and Datanodes
An HDFS cluster has two types of node operating in a master-worker pattern: a namenode(the master) and a number of datanodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

HDFS Federation

HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.

HDFS High-Availability

The 0.23 release series of Hadoop provides support for HDFS high-availability (HA). In this implementation there is a pair of namenodes in an activestandby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.

The Command-Line Interface

Basic Filesystem Operations

Interfaces
Hadoop is written in Java, and all Hadoop filesystem interactions are mediated through the Java API. The filesystem shell, for example, is a Java application that uses the Java FileSystem class to provide filesystem operations.

HTTP

There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP requests to clients; and via a proxy (or proxies), which accesses HDFS onthe client’s behalf using the usual DistributedFileSystem API.

FUSE
Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as a Unix filesystem.

Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a Path object for the file to be created and returns an output stream

Directories
FileSystem provides a method to create a directory:

Data Flow
Anatomy of a File Read

Step 1
The client opens the file it wishes to read by calling open() on the FileSystem object,
which for HDFS is an instance of DistributedFileSystem .

Step 2
DistributedFileSystem calls the namenode, using RPC, to determine the locations of
the blocks for the first few blocks in the file . For each block, the namenode
returns the addresses of the datanodes that have a copy of that block. Furthermore, the
datanodes are sorted according to their proximity to the client (according to the topology
of the cluster’s network;). If
the client is itself a datanode (in the case of a MapReduce task, for instance), then it
will read from the local datanode, if it hosts a copy of the block.
The DistributedFileSystem returns an FSDataInputStream (an input stream that supports
file seeks) to the client for it to read data from. FSDataInputStream in turn wraps
a DFSInputStream, which manages the datanode and namenode I/O.

Step 3
The client then calls read() on the stream. DFSInputStream, which has stored
the datanode addresses for the first few blocks in the file, then connects to the first
(closest) datanode for the first block in the file.

Step 4
Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream .

Step 5
When the endof the block is reached, DFSInputStream will close the connection to the datanode, then
find the best datanode for the next block. Blocks are read in order with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed.

Step 6
When the client has finished reading, it calls close() on the FSDataInputStream.

Anatomy of a File Write

Step 1
The client creates the file by calling create() on DistributedFileSystem.

Step 2
DistributedFileSystem makes an RPC call to the namenode to create a new
file in the filesystem’s namespace, with no blocks associated with it. The namenode
performs various checks to make sure the file doesn’t already exist, and that the
client has the right permissions to create the file. If these checks pass, the namenode
makes a record of the new file; otherwise, file creation fails and the client is thrown an
IOException. The DistributedFileSystem returns an FSDataOutputStream for the client
to start writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutput
Stream, which handles communication with the datanodes and namenode.

Step 3
As the client writes data, DFSOutputStream splits it into packets, which it writes
to an internal queue, called the data queue. The data queue is consumed by the Data
Streamer, whose responsibility it is to ask the namenode to allocate new blocks by
picking a list of suitable datanodes to store the replicas. The list of datanodes forms a
pipeline—we’ll assume the replication level is three, so there are three nodes in the
pipeline. The DataStreamer streams the packets to the first datanode in the pipeline,
which stores the packet and forwards it to the second datanode in the pipeline.

Step 4
Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline.

Step 5.
DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack
queue only when it has been acknowledged by all the datanodes in the pipeline.

Step 6
When the client has finished writing data, it calls close() on the stream.

Step 7
This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments
before contacting the namenode to signal that the file is complete. The namenode already knows which blocks the file is made up of (via Data Streamer asking for block allocations), so it only has to wait for blocks to be minimally replicated before returning successfully.

Coherency Model
A coherency model for a filesystem describes the data visibility of reads and writes for a file. HDFS trades off some POSIX requirements for performance, so some operations may behave differently than you expect them to.

Parallel Copying with distcp

Hadoop comes with a useful program called distcp for copying large amounts of data to and from Hadoop filesystems in parallel.

Keeping an HDFS Cluster Balanced
When copying data into HDFS, it’s important to consider cluster balance. HDFS works best when the file blocks are evenly spread across the cluster, so you want to ensure that distcp doesn’t disrupt this

Hadoop Archives
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files. In particular, Hadoop Archives can be used as input to MapReduce

Excertps from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly