Pig is a scripting language for exploring large datasets. One criticism of MapReduce is
that the development cycle is very long. Writing the mappers and reducers, compiling
and packaging the code, submitting the job(s), and retrieving the results is a timeconsuming
business, and even with Streaming, which removes the compile and package
step, the experience is still involved. Pig’s sweet spot is its ability to process terabytes
of data simply by issuing a half-dozen lines of Pig Latin from the console. Writing queries in
Pig Latin will save you time.
Installing and Running Pig
Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster,
there is nothing extra to install on the cluster: Pig launches jobs and interacts with
HDFS (or other Hadoop filesystems) from your workstation.
Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will need
Cygwin). Download a stable release from http://pig.apache.org/releases.html, and unpack
the tarball in a suitable place on your workstation:
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a
Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster. MapReduce
mode (with a fully distributed cluster) is what you use when you want to run Pig on
you need to point Pig at the cluster’s namenode and jobtracker. If the installation
of Hadoop at HADOOP_HOME is already configured for this, then there is nothing more to
do. Otherwise, you can set HADOOP_CONF_DIR to a directory containing the Hadoop site
file (or files) that define fs.default.name and mapred.job.tracker.
Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig,
Running Pig Programs
Pig can run a script file that contains Pig commands. For example, pig
script.pig runs the commands in the local file script.pig. Alternatively, for very
short scripts, you can use the -e option to run a script specified as a string on the
Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to
run Pig scripts from within Grunt using run and exec.
You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use
Pig Latin Editors
PigPen is an Eclipse plug-in that provides an environment for developing Pig programs.
It includes a Pig script text editor, an example generator (equivalent to the ILLUSTRATE
command), and a button for running the script on a Hadoop cluster.
Pig Latin is a data flow programming language,
whereas SQL is a declarative programming language. In other words, a Pig Latin program
is a step-by-step set of operations on an input relation, in which each step is a
single transformation. By contrast, SQL statements are a set of constraints that, taken
together, define the output.
Pig Latin Reference Manual 2
Data Types and More
Relations, Bags, Tuples, Fields
Arithmetic Operators and More
Casting Relations to Scalars
Bag and Tuple Functions
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
Understanding Pig Latin | Apache Pig Tutorial | Pig Latin Explained__________________
Hadoop Notes and Video Lectures
What is Hadoop? Text and Video Lectures
What is MapReduce? Text and Video Lectures
The Hadoop Distributed Filesystem (HDFS)
Hadoop Input - Output System