Tuesday, May 10, 2016

MapReduce Types and Formats

MapReduce Types and Formats

MapReduce has a simple model of data processing: inputs and outputs for the map and
reduce functions are key-value pairs. This chapter looks at the MapReduce model in
detail and, in particular, how data in various formats, from simple text to structured
binary objects, can be used with this model.
MapReduce Types
The map and reduce functions in Hadoop MapReduce have the following general form:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
In general, the map input key and value types (K1 and V1) are different from the map
output types (K2 and V2). However, the reduce input must have the same types as the
map output, although the reduce output types may be different again (K3 and V3).

Input Formats
Hadoop can process many different types of data formats, from flat text files to databases.

Input Splits and Records
As we saw in Chapter 2, an input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn. Splits and records are logical:
there is nothing that requires them to be tied to files, for example, although in their
most common incarnations, they are. In a database context, a split might correspond
to a range of rows from a table and a record to a row in that range

FileInputFormat is the base class for all implementations of InputFormat that use files
as their data source. It provides two things: a place to define which files
are included as the input to a job, and an implementation for generating splits for the
input files. The job of dividing splits into records is performed by subclasses.

Small files and CombineFileInputFormat
Hadoop works better with a small number of large files than a large number of small
files. One reason for this is that FileInputFormat generates splits in such a way that each
split is all or part of a single file. If the file is very small (“small” means significantly
smaller than an HDFS block) and there are a lot of them, then each map task will process
very little input, and there will be a lot of them (one per file), each of which imposes
extra bookkeeping overhead.

Text Input
Hadoop excels at processing unstructured text. Different InputFormats are provided to process text in Hadoop.

TextInputFormat is the default InputFormat. Each record is a line of input. The key, a
LongWritable, is the byte offset within the file of the beginning of the line. The value is
the contents of the line, excluding any line terminators (newline, carriage return), and
is packaged as a Text object.

If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use. Like TextInputFormat, the keys are the byte
offsets within the file and the values are the lines themselves. N refers to the number of lines of input that each mapper receives. With N set to
one (the default), each mapper receives exactly one line of input. The mapre
duce.input.lineinputformat.linespermap property (mapred.line.input.format.line
spermap in the old API) controls the value of N.

Binary Input
Hadoop MapReduce is not just restricted to processing textual data—it has support
for binary formats, too.
Hadoop’s sequence file format stores sequences of binary key-value pairs. Sequence
files are well suited as a format for MapReduce data since they are splittable (they have
sync points so that readers can synchronize with record boundaries from an arbitrary
point in the file, such as the start of a split), they support compression as a part of the
format, and they can store arbitrary types using a variety of serialization frameworks.

Database Input (and Output)
DBInputFormat is an input format for reading data from a relational database, using
JDBC. Because it doesn’t have any sharding capabilities, you need to be careful not to
overwhelm the database you are reading from by running too many mappers. For this
reason, it is best used for loading relatively small datasets, perhaps for joining with
larger datasets from HDFS, using MultipleInputs. The corresponding output format is
DBOutputFormat, which is useful for dumping job outputs (of modest size) into a

Output Formats
Hadoop has output data formats that correspond to the input formats

Excerpts from  Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly

MapReduce Types and Formats in hadoop


Hadoop Notes and Video Lectures

What is Hadoop? Text and Video Lectures

What is MapReduce? Text and Video Lectures

The Hadoop Distributed Filesystem (HDFS)

Hadoop Input - Output System

No comments:

Post a Comment