Tuesday, May 10, 2016

What is MapReduce? Text and Video Lectures

MapReduce

MapReduce is a programming model for data processing. Hadoop can run MapReduce programs written in various languages; like Java, Ruby, Python, and C++.  MapReduce programs are inherently
parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal. MapReduce comes into its own for large datasets.

Map and Reduce
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.



To scale out, we need to store the data in a distributed filesystem, typically HDFS , to allow Hadoop to move the MapReduce computation to each machine hosting a part of the data.


Data Flow
A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.



Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization since it doesn’t use valuable cluster
bandwidth.

What is MapReduce?
IBM Analytics
_________________

_________________

Introduction to MapReduce
Hortonworks
_________________

__________________



Hadoop Notes and Video Lectures


What is Hadoop? Text and Video Lectures

What is MapReduce? Text and Video Lectures

The Hadoop Distributed Filesystem (HDFS)

Hadoop Input - Output System




No comments:

Post a Comment