Developing a MapReduce Application - Beginner's Steps
The Configuration API
Components in Hadoop are configured using Hadoop’s own configuration API. An
instance of the Configuration class (found in the org.apache.hadoop.conf package)
represents a collection of configuration properties and their values. Each property is
named by a String, and the type of a value may be one of several types, including Java
primitives such as boolean, int, long, float, and other useful types such as String, Class,
java.io.File, and collections of Strings.
Configuring the Development Environment
The first step is to download the version of Hadoop that you plan to use and unpack
it on your development machine. Then, in your favorite IDE, create a new project and add all the JAR files from the top level of the unpacked distribution and from the lib directory to the classpath. You will then be able to compile Java Hadoop programs and run them in local (standalone) mode within the IDE.
Managing Configuration
When developing Hadoop applications, it is common to switch between running the
application locally and running it on a cluster. You may have a local “pseudo-distributed” cluster that you like to
test on (a pseudo-distributed cluster is one whose daemons all run on the local machine).
GenericOptionsParser, Tool, and ToolRunner
Hadoop comes with a few helper classes for making it easier to run jobs from the
command line. GenericOptionsParser is a class that interprets common Hadoop
command-line options and sets them on a Configuration object for your application to
use as desired. You don’t usually use GenericOptionsParser directly, as it’s more
convenient to implement the Tool interface and run your application with the
ToolRunner, which uses GenericOptionsParser internally:
Writing a Unit Test
The map and reduce functions in MapReduce are easy to test in isolation, which is a
consequence of their functional style. For known inputs, they produce known outputs.
Testing the Driver
Apart from the flexible configuration options offered by making your application implement
Tool, you also make it more testable because it allows you to inject an arbitrary
Configuration. You can take advantage of this to write a test that uses a local job runner
to run a job against known input data, which checks that the output is as expected.
Running on a Cluster
If you are happy with the program running on a small test dataset, you are ready
to try it on the full dataset on a Hadoop cluster.
The MapReduce Web UI
Hadoop comes with a web UI for viewing information about your jobs. It is useful for
following a job’s progress while it is running, as well as finding job statistics and logs
after the job has completed.
For more complex problems, it is worth considering a higher-level language than Map-
Reduce, such as Pig, Hive, Cascading, Cascalog, or Crunch. One immediate benefit is
that it frees you up from having to do the translation into MapReduce jobs, allowing
you to concentrate on the analysis you are performing.
O'Reilly Webcast: An Introduction to Hadoop
O'Reilly
_________________
_________________
12 Developing Word Count Map Reduce Example
nataraz Java
_________________
_________________
Hadoop Map Reduce Development - Map Reduce API introduction
itversity
_________________
_________________
What is Hadoop? Text and Video Lectures
What is MapReduce? Text and Video Lectures
The Hadoop Distributed Filesystem (HDFS)
Hadoop Input - Output System
The Configuration API
Components in Hadoop are configured using Hadoop’s own configuration API. An
instance of the Configuration class (found in the org.apache.hadoop.conf package)
represents a collection of configuration properties and their values. Each property is
named by a String, and the type of a value may be one of several types, including Java
primitives such as boolean, int, long, float, and other useful types such as String, Class,
java.io.File, and collections of Strings.
Configuring the Development Environment
The first step is to download the version of Hadoop that you plan to use and unpack
it on your development machine. Then, in your favorite IDE, create a new project and add all the JAR files from the top level of the unpacked distribution and from the lib directory to the classpath. You will then be able to compile Java Hadoop programs and run them in local (standalone) mode within the IDE.
Managing Configuration
When developing Hadoop applications, it is common to switch between running the
application locally and running it on a cluster. You may have a local “pseudo-distributed” cluster that you like to
test on (a pseudo-distributed cluster is one whose daemons all run on the local machine).
GenericOptionsParser, Tool, and ToolRunner
Hadoop comes with a few helper classes for making it easier to run jobs from the
command line. GenericOptionsParser is a class that interprets common Hadoop
command-line options and sets them on a Configuration object for your application to
use as desired. You don’t usually use GenericOptionsParser directly, as it’s more
convenient to implement the Tool interface and run your application with the
ToolRunner, which uses GenericOptionsParser internally:
Writing a Unit Test
The map and reduce functions in MapReduce are easy to test in isolation, which is a
consequence of their functional style. For known inputs, they produce known outputs.
Testing the Driver
Apart from the flexible configuration options offered by making your application implement
Tool, you also make it more testable because it allows you to inject an arbitrary
Configuration. You can take advantage of this to write a test that uses a local job runner
to run a job against known input data, which checks that the output is as expected.
Running on a Cluster
If you are happy with the program running on a small test dataset, you are ready
to try it on the full dataset on a Hadoop cluster.
The MapReduce Web UI
Hadoop comes with a web UI for viewing information about your jobs. It is useful for
following a job’s progress while it is running, as well as finding job statistics and logs
after the job has completed.
For more complex problems, it is worth considering a higher-level language than Map-
Reduce, such as Pig, Hive, Cascading, Cascalog, or Crunch. One immediate benefit is
that it frees you up from having to do the translation into MapReduce jobs, allowing
you to concentrate on the analysis you are performing.
O'Reilly Webcast: An Introduction to Hadoop
O'Reilly
_________________
_________________
12 Developing Word Count Map Reduce Example
nataraz Java
_________________
_________________
Hadoop Map Reduce Development - Map Reduce API introduction
itversity
_________________
_________________
Hadoop Notes and Video Lectures
What is Hadoop? Text and Video Lectures
What is MapReduce? Text and Video Lectures
The Hadoop Distributed Filesystem (HDFS)
Hadoop Input - Output System
No comments:
Post a Comment