Tuesday, May 10, 2016

What is Hadoop? Text and Video Lectures

Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce.

The approach taken by MapReduce is the premise  that the entire dataset—or at least a good portion of it—is processed for each query. MapReduce is a batch query processor, and it has the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time.

MapReduce is a linearly scalable programming model. The programmer writes two
functions—a map function and a reduce function—each of which defines a mapping
from one set of key-value pairs to another.

MapReduce was invented by engineers at Google as a system for building production search indexes because they found themselves solving the sameproblem over and over again (and MapReduce was inspired by older ideas from the functional programming, distributed computing, and database communities), but it has since been used for many other applications in many other industries. It is pleasantly surprising to see the range of algorithms that can be expressed in MapReduce, from
image analysis, to graph-based problems, to machine learning algorithms. It can’t solve every problem, of course, but it is a general data-processing tool.

MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects.

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

Development that helped Hadoop Development.

A paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google.

In 2004, Google published the paper that introduced MapReduce to the world.

In February 2008 when Yahoo! announced that its production search index was being generated by a
10,000-core Hadoop cluster.

In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds. In November of the same year, Google reported that its MapReduce implementation sorted one terabyte in 68 seconds. In May 2009, it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.

Introducing Apache Hadoop: The Modern Data Operating System


Apache Hadoop & Big Data 101: The Basics
Cloudera, Inc.


Hadoop Notes and Video Lectures

What is Hadoop? Text and Video Lectures

What is MapReduce? Text and Video Lectures

The Hadoop Distributed Filesystem (HDFS)

Hadoop Input - Output System

Developing a Simple MapReduce Application

No comments:

Post a Comment