Tuesday, May 10, 2016

MapReduce Advanced Features

MapReduce Advanced Features

There are often things you would like to know about the data you are analyzing but
that are peripheral to the analysis you are performing.Counters are a useful channel for gathering statistics about the job: for quality control
or for application level-statistics. They are also useful for problem diagnosis.

Built-in Counters
Hadoop maintains some built-in counters for every job, which report various metrics
for your job. For example, there are counters for the number of bytes and records
processed, which allows you to confirm that the expected amount of input was consumed
and the expected amount of output was produced.

Task counters
Task counters gather information about tasks over the course of their execution, and
the results are aggregated over all the tasks in a job. For example, the
MAP_INPUT_RECORDS counter counts the input records read by each map task and aggregates
over all map tasks in a job, so that the final figure is the total number of input
records for the whole job.

Job counters
Job counters  are maintained by the jobtracker (or application master in
YARN), so they don’t need to be sent across the network, unlike all other counters,
including user-defined ones. They measure job-level statistics, not values that change
while a task is running. For example, TOTAL_LAUNCHED_MAPS counts the number of map
tasks that were launched over the course of a job (including ones that failed).

User-Defined Java Counters
MapReduce allows user code to define a set of counters, which are then incremented
as desired in the mapper or reducer. Counters are defined by a Java enum, which serves
to group related counters. A job may define an arbitrary number of enums, each with
an arbitrary number of fields. The name of the enum is the group name, and the enum’s
fields are the counter names. Counters are global: the MapReduce framework aggregates
them across all maps and reduces to produce a grand total at the end of the job.

Hadoop Map Reduce Development - Counters - Introduction


The ability to sort data is at the heart of MapReduce. Even if your application isn’t
concerned with sorting per se, it may be able to use the sorting stage that MapReduce
provides to organize its data.

Partial Sort

Total Sort
It is possible to produce a set of sorted files that, if concatenated, would form
a globally sorted file. The secret to doing this is to use a partitioner that respects the
total order of the output.

Secondary Sort

To do a secondary sort in Streaming, we can take advantage of a couple of library classes
that Hadoop provides.

MapReduce can perform joins between large datasets, but writing the code to do joins
from scratch is fairly involved. Rather than writing MapReduce programs, you might
consider using a higher-level framework such as Pig, Hive, or Cascading, in which join
operations are a core part of the implementation.

Map-Side Joins
A map-side join between large inputs works by performing the join before the data
reaches the map function. For this to work, though, the inputs to each map must be
partitioned and sorted in a particular way. Each input dataset must be divided into the
same number of partitions, and it must be sorted by the same key (the join key) in each
source. All the records for a particular key must reside
in the same partition.

Reduce-Side Joins
A reduce-side join is more general than a map-side join, in that the input datasets don’t
have to be structured in any particular way, but it is less efficient as both datasets have
to go through the MapReduce shuffle. The basic idea is that the mapper tags each record
with its source and uses the join key as the map output key, so that the records with
the same key are brought together in the reducer.

Joins in Hadoop Mapreduce | Mapside Joins | Reduce Side Joins | Hadoop Mapreduce Tutorial


Side Data Distribution
Side data can be defined as extra read-only data needed by a job to process the main
dataset. The challenge is to make side data available to all the map or reduce tasks
(which are spread across the cluster) in a convenient and efficient fashion.

Distributed Cache
Rather than serializing side data in the job configuration, it is preferable to distribute
datasets using Hadoop’s distributed cache mechanism. This provides a service for
copying files and archives to the task nodes in time for the tasks to use them when they
run. To save network bandwidth, files are normally copied to any particular node once
per job.

MapReduce Library Classes
Hadoop comes with a library of mappers and reducers for commonly used functions.

Excerpts from  Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly

Hadoop Notes and Video Lectures

What is Hadoop? Text and Video Lectures

What is MapReduce? Text and Video Lectures

The Hadoop Distributed Filesystem (HDFS)

Hadoop Input - Output System

No comments:

Post a Comment