Computer Science and Engineering Knowledge Center: Introduction to Pig Latin

Pig Latin

Pig is a scripting language for exploring large datasets. One criticism of MapReduce is
that the development cycle is very long. Writing the mappers and reducers, compiling
and packaging the code, submitting the job(s), and retrieving the results is a timeconsuming
business, and even with Streaming, which removes the compile and package
step, the experience is still involved. Pig’s sweet spot is its ability to process terabytes
of data simply by issuing a half-dozen lines of Pig Latin from the console. Writing queries in
Pig Latin will save you time.

Installing and Running Pig
Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster,
there is nothing extra to install on the cluster: Pig launches jobs and interacts with
HDFS (or other Hadoop filesystems) from your workstation.

Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will need
Cygwin). Download a stable release from http://pig.apache.org/releases.html, and unpack
the tarball in a suitable place on your workstation:

In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a
Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster. MapReduce
mode (with a fully distributed cluster) is what you use when you want to run Pig on
large datasets.

you need to point Pig at the cluster’s namenode and jobtracker. If the installation
of Hadoop at HADOOP_HOME is already configured for this, then there is nothing more to
do. Otherwise, you can set HADOOP_CONF_DIR to a directory containing the Hadoop site
file (or files) that define fs.default.name and mapred.job.tracker.

Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig,

Running Pig Programs

Script
Pig can run a script file that contains Pig commands. For example, pig
script.pig runs the commands in the local file script.pig. Alternatively, for very
short scripts, you can use the -e option to run a script specified as a string on the
command line.
Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to
run Pig scripts from within Grunt using run and exec.
Embedded
You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.

Pig Latin Editors
PigPen is an Eclipse plug-in that provides an environment for developing Pig programs.
It includes a Pig script text editor, an example generator (equivalent to the ILLUSTRATE
command), and a button for running the script on a Hadoop cluster.

Pig Latin is a data flow programming language,
whereas SQL is a declarative programming language. In other words, a Pig Latin program
is a step-by-step set of operations on an input relation, in which each step is a
single transformation. By contrast, SQL statements are a set of constraints that, taken
together, define the output.

Pig Latin Reference Manual 2
Overview
Conventions
Reserved Keywords
Data Types and More
Relations, Bags, Tuples, Fields
Data Types
Nulls
Constants
Expressions
Schemas
Parameter Substitution
Arithmetic Operators and More
Arithmetic Operators
Comparison Operators
Null Operators
Boolean Operators
Dereference Operators
Sign Operators
Flatten Operator
Cast Operators
Casting Relations to Scalars
Relational Operators
COGROUP
CROSS
DISTINCT
FILTER
FOREACH
GROUP
JOIN (inner)
JOIN (outer)
LIMIT
LOAD
MAPREDUCE
ORDER BY
SAMPLE
SPLIT
STORE
STREAM
UNION
Diagnostic Operators
DESCRIBE
DUMP
EXPLAIN
ILLUSTRATE
UDF Statements
DEFINE
REGISTER
Eval Functions
AVG
CONCAT
Example
COUNT
COUNT_STAR
DIFF
IsEmpty
MAX
MIN
SIZE
SUM
TOKENIZE
Load/Store Functions
Handling Compression
BinStorage
PigStorage
PigDump
TextLoader
Math Functions
ABS
ACOS
ASIN
ATAN
CBRT
CEIL
COSH
COS
EXP
FLOOR
LOG
LOG10
RANDOM
ROUND
SIN
SINH
SQRT
TAN
TANH
String Functions
INDEXOF
LAST_INDEX_OF
LCFIRST
LOWER
REGEX_EXTRACT
REGEX_EXTRACT_ALL
REPLACE
STRSPLIT
SUBSTRING
TRIM
UCFIRST
UPPER
Bag and Tuple Functions
TOBAG
TOP
TOTUPLE
File Commands
cat
cd
copyFromLocal
copyToLocal
cp
ls
mkdir
mv
pwd
rm
rmf
Shell Commands
fs
sh
Utility Commands
exec
help
kill
quit
run
set

https://pig.apache.org/docs/r0.8.1/piglatin_ref2.html

Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly