Array has dimensions. A vector is an array with only one dimension. An array with two dimensions is a matrix.
Anything with more than two dimensions is simply called an array.
Technically, a vector has no dimensions at all in R. If you use the functions dim(), nrow(), or ncol(), with a vector as argument, R returns NULL as a result.
Creating matrix
use the matrix() function.
The matrix() function has arguments to specify the matrix.
data is a vector of values you want in the matrix.
ncol takes a single number that tells R how many columns you want.
nrow takes a single number that tells R how many rows you want.
byrow takes a logical value that tells R whether you want to fill the matrix rowwise
(TRUE) or column-wise (FALSE). Column-wise is the default.
You don’t have to specify both ncol and nrow. If you specify one, R will know automatically what the other needs to be.
You can look at the structure of an object using the str() function.
If you want the number of rows and columns without looking at the structure, you can use the
dim() function.
You can find the total number of values in a matrix exactly the same way as
you do with a vector, using the length() function:
To see all the attributes of an object, you can use the attributes() function
You can combine both vectors as two rows of a matrix with the rbind() function,
The cbind() function does something similar. It binds the vectors as columns of a matrix,
you have the functions rownames() and colnames(). Both functions work much like the names() function you use when naming vector values.
Calculating with Matrices
You add a scalar to a matrix simply by using the addition operator, +,
With the addition operator, you can add both matrices together, if the
dimensions of both matrices are not the same, R will complain and refuse to carry
out the operation
By default, R fills matrices column-wise. Whenever R reads a matrix, it also
reads it column-wise.
Transposing a matrix
The t() function (which stands for transpose) will do the work
To
invert a matrix, you use the solve() function
The multiplication operator (*) works element-wise on matrices. To calculate
the inner product of two matrices, use the special operator %*%,
R has a range of functions that allow you to work with dates and times. The
easiest way of creating a date is to use the as.Date( ) function.
The default format for dates in as.Date( ) is YYYY-MM-DD
— four digits for year, and two digits for month and day, separated by a
hyphen.
To find out what day of the week this is, use weekdays():
You can add or subtract numbers from dates to create new dates.
use the seq( ) function to create sequences of dates in a far more
flexible way. As with numeric vectors, you have to specify at least three of the
arguments (from, to, by, and length.out).
In addition to weekdays( ), you also can get R to report on months() and
quarters( ):
Functions with Dates
Function
Description
as.Date()
Converts character string to Date
weekdays()
Full weekday name in the current locale (for example, Sunday, Monday, Tuesday)
months()
Full month name in the current locale (for example, January, February, March)
quarters()
Quarter numbers (Q1, Q2, Q3, or Q4)
seq()
Generates dates sequences if you pass it a Date object as its first argument
Objects that represent time series data.
ts: In R, you use the ts() function to create time series objects. These
are vector or matrix objects that contain information about the observations,
together with information about the start, frequency, and end of each
observation period. With ts class data you can use powerful R functions to do
modeling and forecasting — for example, arima() is a general model for time
series data.
abs(x) Takes the absolute value of x
log(x,base=y) Takes the logarithm of x with base y; if base is not specified, returns the natural logarithm
exp(x) Returns the exponential of x
sqrt(x) Returns the square root of x
factorial(x) Returns the factorial of x (x!)
choose(x,y)
Returns the number of possible combinations when drawing y elements at a time from x
possibilities
round(x,digits=2)
signif(x,digits=4)
cos(120)
R always works with angles in radians, not in degrees. Pay attention to this fact
So correct way to cos of 120 degrees is to write cos(120*pi/180)
The str() function gives you the type and structure of the object.
If you want to know only how long a vector is, you can simply use the
length() function,
Creating vectors
seq(from = 4.5, to = 2.5, by = -0.5)
The c() function stands for concatenate. It doesn’t create vectors — it combines them.
To repeat the vector c(0, 0, 7) three times, use this code: rep(c(0, 0, 7), times = 3)
You also can repeat every value by specifying the argument each in place of times.
rep(1:3,length.out=7) says repeat the vector till length is 7. The last repetition may be incomplete.
The bracket [] represents a function that you can use to extract a value from that vector. You can get the fifth value of the vector numbers bygiving the command.
# is for comments
R ignores everything that appears after the hash (#).
The assignment operator is symbol for less than followed by hash. (The blogger is creating a problem by giving strange interpretation of the symbol.)
#read files with labels in first row
read.table(filename,header=TRUE) #read a tab or space delimited file
read.table(filename,header=TRUE,sep=',') #read csv files
mat[4,2] #display the 4th row and the 2nd column
mat[3,] #display the 3rd row
mat[,2] #display the 2nd column
subset(dataset,logical) #those objects meeting a logical criterion
subset(data.df,select=variables,logical) #get those objects from a data frame that meet a criterion
data.df[data.df=logical] #yet another way to get a subset
x[order(x$B),] #sort a dataframe by the order of the elements in B
x[rev(order(x$B)),] #sort the dataframe in reverse order
browse.workspace #a Mac menu command that creates a window with information about all variables in the workspace
->->->->->->->->
Details
Input and display
#read files with labels in first row
read.table(filename,header=TRUE) #read a tab or space delimited file
read.table(filename,header=TRUE,sep=',') #read csv files
mat <- 2="" a="" and="" cbind="" combine="" into="" matrix="" n="" nbsp="" p="" x="" y="">mat[4,2] #display the 4th row and the 2nd column
mat[3,] #display the 3rd row
mat[,2] #display the 2nd column
subset(dataset,logical) #those objects meeting a logical criterion
subset(data.df,select=variables,logical) #get those objects from a data frame that meet a criterion
data.df[data.df=logical] #yet another way to get a subset
x[order(x$B),] #sort a dataframe by the order of the elements in B
x[rev(order(x$B)),] #sort the dataframe in reverse order
browse.workspace #a Mac menu command that creates a window with information about all variables in the workspace
Rules of Names of Variables, Vectors and Matrices in R
Names must start with a letter or a dot. If you start a name with a dot, the
second character can’t be a digit.
Names should contain only letters, numbers, underscore characters (_),
and dots (.). Although you can force R to accept other characters in names, you
shouldn’t, because these characters often have a special meaning in R.
You can’t use the following special keywords as names:
• break
• else
• FALSE
• for
• function
• if
• Inf
• NA
• NaN
• next
• repeat
• return
• TRUE
• while->->->->->->->->->->
Input and display
#read files with labels in first row
read.table(filename,header=TRUE) #read a tab or space delimited file
read.table(filename,header=TRUE,sep=',') #read csv files
x <- a="" c="" create="" data="" elements="" nbsp="" p="" specified="" vector="" with="">y <- 1="" a="" c="" create="" data="" elements="" nbsp="" p="" to10="" vector="" with="">n <- 10="" p="">x1 <- a="" c="" create="" deviates="" item="" n="" nbsp="" normal="" of="" p="" random="" rnorm="" vector="">y1 <- a="" added="" c="" create="" distribution="" each="" has="" item="" n="" nbsp="" p="" random="" runif="" that="" to="" uniform="" vector="">z <- binomial="" create="" from="" n="" nbsp="" of="" p="" prob="" probability="" rbinom="" samples="" size="" the="" with="">vect <- and="" c="" combine="" into="" length="" nbsp="" of="" one="" p="" vector="" vectors="" x="" y="">mat <- 2="" a="" and="" cbind="" combine="" into="" matrix="" n="" nbsp="" p="" x="" y="">mat[4,2] #display the 4th row and the 2nd column
mat[3,] #display the 3rd row
mat[,2] #display the 2nd column
subset(dataset,logical) #those objects meeting a logical criterion
subset(data.df,select=variables,logical) #get those objects from a data frame that meet a criterion
data.df[data.df=logical] #yet another way to get a subset
x[order(x$B),] #sort a dataframe by the order of the elements in B
x[rev(order(x$B)),] #sort the dataframe in reverse order
browse.workspace #a Mac menu command that creates a window with information about all variables in the workspace
Moving around
ls() #list the variables in the workspace
rm(x) #remove x from the workspace
rm(list=ls()) #remove all the variables from the workspace
attach(mat) #make the names of the variables in the matrix or data frame available in the workspace
detach(mat) #releases the names (remember to do this each time you attach something)
with(mat, .... ) #a preferred alternative to attach ... detach
new <- column="" drop="" n="" nbsp="" nth="" old="" p="" the="">new <- drop="" n="" nbsp="" nth="" old="" p="" row="" the="">new <- and="" c="" column="" drop="" i="" ith="" j="" jth="" nbsp="" old="" p="" the="">new <- cases="" condition="" logical="" meet="" nbsp="" old="" p="" select="" subset="" that="" the="" those="">complete <- cases="" complete.cases="" data.df="" find="" missing="" nbsp="" no="" p="" subset="" those="" values="" with="">new <- n1:n2="" n1="" n2="" n3:n4="" n3="" n4="" nbsp="" of="" old="" p="" rows="" select="" the="" through="" variables="">
replace(x, list, values) #remember to assign this to some object i.e., x <- replace="" x="=-9,NA) </p"> #similar to the operation x[x==-9] <- na="" p="">scrub(x, where, min, max, isvalue,newvalue) #a convenient way to change particular values (in psych package)
x.df <- ...="" a="" combine="" data.frame="" data="" different="" frame="" into="" kinds="" nbsp="" of="" p="" x1="" x2="" x3=""> as.data.frame() is.data.frame()
x <- as.matrix="" p="">scale() #converts a data frame to standardized scores
round(x,n) #rounds the values of x to n decimal places
ceiling(x) #vector x of smallest integers > x
floor(x) #vector x of largest interger < x
as.integer(x) #truncates real x to integers (compare to round(x,0)
as.integer(x < cutpoint) #vector x of 0 if less than cutpoint, 1 if greater than cutpoint)
factor(ifelse(a < cutpoint, "Neg", "Pos")) #is another way to dichotomize and to make a factor for analysis
transform(data.df,variable names = some operation) #can be part of a set up for a data set
x%in%y #tests each element of x for membership in y
y%in%x #tests each element of y for membership in x
all(x%in%y) #true if x is a proper subset of y
all(x) # for a vector of logical values, are they all true?
any(x) #for a vector of logical values, is at least one true?
Statistics and transformations
max(x, na.rm=TRUE) #Find the maximum value in the vector x, exclude missing values
min(x, na.rm=TRUE)
mean(x, na.rm=TRUE)
median(x, na.rm=TRUE)
sum(x, na.rm=TRUE)
var(x, na.rm=TRUE) #produces the variance covariance matrix
sd(x, na.rm=TRUE) #standard deviation
mad(x, na.rm=TRUE) #(median absolute deviation)
fivenum(x, na.rm=TRUE) #Tukey fivenumbers min, lowerhinge, median, upper hinge, max
table(x) #frequency counts of entries, ideally the entries are factors(although it works with integers or even reals)
scale(data,scale=FALSE) #centers around the mean but does not scale by the sd)
cumsum(x,na=rm=TRUE) #cumulative sum, etc.
cumprod(x)
cummax(x)
cummin(x)
rev(x) #reverse the order of values in x
cor(x,y,use="pair") #correlation matrix for pairwise complete data, use="complete" for complete cases
aov(x~y,data=datafile) #where x and y can be matrices
aov.ex1 = aov(DV~IV,data=data.ex1) #do the analysis of variance or
aov.ex2 = aov(DV~IV1*IV21,data=data.ex2) #do a two way analysis of variance
summary(aov.ex1) #show the summary table
print(model.tables(aov.ex1,"means"),digits=3) #report the means and the number of subjects/cell
boxplot(DV~IV,data=data.ex1) #graphical summary appears in graphics window
lm(x~y,data=dataset) #basic linear model where x and y can be matrices (see plot.lm for plotting options)
t.test(x,g)
pairwise.t.test(x,g)
power.anova.test(groups = NULL, n = NULL, between.var = NULL,
within.var = NULL, sig.level = 0.05, power = NULL)
power.t.test(n = NULL, delta = NULL, sd = 1, sig.level = 0.05,
power = NULL, type = c("two.sample", "one.sample", "paired"),
alternative = c("two.sided", "one.sided"),strict = FALSE)
Regression, the linear model, factor analysis and principal components analysis (PCA)
matrices
t(X) #transpose of X
X %*% Y #matrix multiply X by Y
solve(A) #inverse of A
solve(A,B) #inverse of A * B (may be used for linear regression)
data frames are needed for regression
lm(Y~X1+X2)
lm(Y~X|W)
factanal() (see also fa in the psych package)
princomp() (see principal in the psych package)
Useful additional commands
colSums (x, na.rm = FALSE, dims = 1)
rowSums (x, na.rm = FALSE, dims = 1)
colMeans(x, na.rm = FALSE, dims = 1)
rowMeans(x, na.rm = FALSE, dims = 1)
rowsum(x, group, reorder = TRUE, ...) #finds row sums for each level of a grouping variable
apply(X, MARGIN, FUN, ...) #applies the function (FUN) to either rows (1) or columns (2) on object X
apply(x,1,min) #finds the minimum for each row
apply(x,2,max) #finds the maximum for each column
col.max(x) #another way to find which column has the maximum value for each row
which.min(x)
which.max(x)
z=apply(x,1,which.min) #tells the row with the minimum value for every column
Graphics
par(mfrow=c(nrow,mcol)) #number of rows and columns to graph
par(ask=TRUE) #ask for user input before drawing a new graph
par(omi=c(0,0,1,0) ) #set the size of the outer margins
mtext("some global title",3,outer=TRUE,line=1,cex=1.5) #note that we seem to need to add the global title last
#cex = character expansion factor
boxplot(x,main="title") #boxplot (box and whiskers)
title( "some title") #add a title to the first graph
hist() #histogram
plot() plot(x,y,xlim=range(-1,1),ylim=range(-1,1),main=title) par(mfrow=c(1,1)) #change the graph window back to one figure symb=c(19,25,3,23) colors=c("black","red","green","blue") charact=c("S","T","N","H") plot(PA,NAF,pch=symb[group],col=colors[group],bg=colors[condit],cex=1.5,main="Postive vs. Negative Affect by Film condition") points(mPA,mNA,pch=symb[condit],cex=4.5,col=colors[condit],bg=colors[condit])
curve()
abline(a,b) abline(a, b, untf = FALSE, ...)
abline(h=, untf = FALSE, ...)
abline(v=, untf = FALSE, ...)
abline(coef=, untf = FALSE, ...)
abline(reg=, untf = FALSE, ...)
identify() plot(eatar,eanta,xlim=range(-1,1),ylim=range(-1,1),main=title) identify(eatar,eanta,labels=labels(energysR[,1]) ) #dynamically puts names on the plots
locate()
legend()
pairs() #SPLOM (scatter plot Matrix)
pairs.panels () #SPLOM on lower off diagonal, histograms on diagonal, correlations on diagonal
#not standard R, but in the psych package
matplot ()
biplot ())
plot(table(x)) #plot the frequencies of levels in x
x= recordPlot() #save the current plot device output in the object x
replayPlot(x) #replot object x
dev.control #various control functions for printing/saving graphic files
pdf(height=6, width=6) #create a pdf file for output
dev.of() #close the pdf file created with pdf
layout(mat) #specify where multiple graphs go on the page
#experiment with the magic code from Paul Murrell to do fancy graphic location
layout(rbind(c(1, 1, 2, 2, 3, 3),
c(0, 4, 4, 5, 5, 0)))
for (i in 1:5) {
plot(i, type="n")
text(1, i, paste("Plot", i), cex=4)
}
Distributions
To generate random samples from a variety of distributions
rnorm(n,mean,sd)
rbinom(n,size,p)
sample(x, size, replace = FALSE, prob = NULL) #samples with or without replacement
Working with Dates
date <-strptime a="" as.character="" change="" d="" date="" field="" for="" form="" internal="" m="" nbsp="" p="" the="" time="" to="" y=""> #see ?formats and ?POSIXlt
as.Date
month= months(date) #see also weekdays, Julian
APACHE MAHOUT
An algorithm library for scalable machine learning on Hadoop
Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm.
Mahout provides the data science tools to automatically find meaningful patterns in big data sets stored on the Hadoop Distributed File System (HDFS)
WHAT MAHOUT DOES
Mahout supports four main data mining use cases:
Collaborative filtering – Based on user behavior, makes product recommendations (e.g. YouTube recommended movies)
Clustering – takes items in a particular class (such as web pages or newspaper articles) and organizes them into groups or clusters, such that items belonging to the same group are similar to each other
Classification – learns from existing categorizations and then assigns unclassified items to the best category
Frequent itemset mining – analyzes items in a group (e.g. items in a shopping cart) and then identifies which items typically appear together
Administrators of Hadoop must have a basic understanding of how the components
of HDFS—the namenode, the secondary namenode, and the datanodes—
organize their persistent data on disk. Knowing which files are which can help in
diagnosing problems or spotting the problem instances.
Audit Logging
HDFS has the ability to log all filesystem access requests, a feature that some organizations
require for auditing purposes. Audit logging is implemented using log4j logging
at the INFO level,
Tools
dfsadmin
The dfsadmin tool is a multipurpose tool for finding information about the state of
HDFS, as well as performing administration operations on HDFS.
Filesystem check (fsck)
Hadoop provides an fsck utility for checking the health of files in HDFS. The tool looks
for blocks that are missing from all datanodes, as well as under- or over-replicated
blocks.
Monitoring
Monitoring is an important part of system administration.
The purpose of monitoring is to detect when the cluster is not providing the expected
level of service.
Logging
All Hadoop daemons produce logfiles that can be very useful for finding out what is
happening in the system.
Getting stack traces
Hadoop daemons expose a web page (/stacks in the web UI) that produces a thread
dump for all running threads in the daemon’s JVM.
Metrics
The HDFS and MapReduce daemons collect information about events and measurements
that are collectively known as metrics. Some metrics for example are the metrics collected by datanodes:
the number of bytes written, the number of blocks
replicated, and the number of read requests from clients (both local and remote).
Metrics belong to a context, and Hadoop currently uses “dfs”, “mapred”, “rpc”, and
“jvm” contexts. Hadoop daemons usually collect metrics under several contexts. For
example, datanodes collect metrics for the “dfs”, “rpc”, and “jvm” contexts.
Maintenance
Routine Administration Procedures
Metadata backups
If the namenode’s persistent metadata is lost or damaged, the entire filesystem is rendered
unusable, so it is critical that backups are made of these files. You should keep
multiple copies of different ages (one hour, one day, one week, and one month, say) to
protect against corruption, either in the copies themselves or in the live files running
on the namenode.
Data backups
Although HDFS is designed to store data reliably, data loss can occur, just like in any
storage system, and thus a backup strategy is essential. The key is to prioritize data to be backed up. The highest priority is the data that cannot
be regenerated and that is critical to the business.
The distcp tool is ideal for making backups to other HDFS clusters or
other Hadoop filesystems (such as S3 or KFS), since it can copy files in parallel. Alternatively,
an entirely different storage system can be employed for backups, using one of
the ways to export data from HDFS.
Filesystem check (fsck)
It is advisable to run HDFS’s fsck tool regularly (for example, daily) on the whole filesystem
to proactively look for missing or corrupt blocks.
Filesystem balancer
Run the balancer tool (see “balancer” on page 304) regularly to keep the filesystem
datanodes evenly balanced.
Commissioning and Decommissioning Nodes
As an administrator of a Hadoop cluster, you will need to add or remove nodes from
time to time.
Upgrades
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
More references
Hadoop Administration and Maintenance (includes list of commands)
Sqoop is an open-source tool that allows
users to extract data from a relational database into Hadoop for further processing.
This processing can be done with MapReduce programs or other higher-level tools such
as Hive. (It’s even possible to use Sqoop to move data from a relational database into
HBase.) When the final results of an analytic pipeline are available, Sqoop can export
these results back to the database for consumption by other clients.
Getting Sqoop
Sqoop is available in a few places. The primary home of the project is http://incubator
.apache.org/sqoop/. This repository contains all the Sqoop source code and documentation.
Official releases are available at this site, as well as the source code for the version
currently under development. The repository itself contains instructions for compiling
the project.
After you install Sqoop, you can use it to import data to Hadoop.
Sqoop imports from databases. The list of databases that it has been tested with includes
MySQL, PostgreSQL, Oracle, SQL Server and DB2.
By default, Sqoop will generate comma-delimited text files for our imported data. Delimiters
can be explicitly specified, as well as field enclosing and escape characters to
allow the presence of delimiters in the field contents. The command-line arguments
that specify delimiter characters, file formats, compression, and more fine-grained
control of the import process are described in the Sqoop User Guide distributed with
Sqoop
Controlling the Import
Sqoop does not need to import an entire table at a time. For example, a subset of the
table’s columns can be specified for import. Users can also specify a WHERE clause to
include in queries, which bound the rows of the table to import
Working with Imported Data
Once data has been imported to HDFS, it is now ready for processing by custom Map-
Reduce programs. Text-based imports can be easily used in scripts run with Hadoop
Streaming or in MapReduce jobs run with the default TextInputFormat.
Imported Data and Hive
Using a system like Hive to handle
relational operations can dramatically ease the development of the analytic pipeline.
Especially for data originally from a relational data source, using Hive makes a lot of
sense. Hive and Sqoop together form a powerful toolchain for performing analysis.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
____________________
Hadoop’s distributed coordination service is called ZooKeeper.
what ZooKeeper does
do is give you a set of tools to build distributed applications that can safely handle
partial failures.
ZooKeeper also has the following characteristics:
ZooKeeper is simple
ZooKeeper is, at its core, a stripped-down filesystem that exposes a few simple
operations, and some extra abstractions such as ordering and notifications.
ZooKeeper is expressive
The ZooKeeper primitives are a rich set of building blocks that can be used to build
a large class of coordination data structures and protocols.
Group Membership in ZooKeeper
One way of understanding ZooKeeper is to think of it as providing a high-availability
filesystem. It doesn’t have files and directories, but a unified concept of a node, called
a znode, which acts both as a container of data (like a file) and a container of other
znodes (like a directory).
Joining a Group
The next part of the application is a program to register a member in a group. Each
member will run as a program and join a group. When the program exits, it should be
removed from the group, which we can do by creating an ephemeral znode that represents
it in the ZooKeeper namespace.
ZooKeeper command-line tools
Data Model
ZooKeeper maintains a hierarchical tree of nodes called znodes. A znode stores data
and has an associated ACL. ZooKeeper is designed for coordination (which typically
uses small data files), not high-volume data storage, so there is a limit of 1 MB on the
amount of data that may be stored in any znode.
Building Applications with ZooKeeper
A Configuration Service
One of the most basic services that a distributed application needs is a configuration
service so that common pieces of configuration information can be shared by machines
in a cluster. At the simplest level, ZooKeeper can act as a highly available store for
configuration, allowing application participants to retrieve or update configuration
files. Using ZooKeeper watches, it is possible to create an active configuration service,
where interested clients are notified of changes in configuration.
Recoverable exceptions are those from which the application can
recover within the same ZooKeeper session. A recoverable exception is manifested by
KeeperException.ConnectionLossException, which means that the connection to
ZooKeeper has been lost. ZooKeeper will try to reconnect, and in most cases the reconnection
will succeed and ensure that the session is intact.
A Lock Service
A distributed lock is a mechanism for providing mutual exclusion between a collection
of processes. At any one time, only a single process may hold the lock. Distributed locks
can be used for leader election in a large distributed system, where the leader is the
process that holds the lock at any point in time.
More Distributed Data Structures and Protocols
There are many distributed data structures and protocols that can be built with Zoo-
Keeper, such as barriers, queues, and two-phase commit. One interesting thing to note
is that these are synchronous protocols, even though we use asynchronous ZooKeeper
primitives (such as notifications) to build them.
Resilience and Performance
ZooKeeper machines should be located to minimize the impact of machine and network
failure. In practice, this means that servers should be spread across racks, power supplies,
and switches, so that the failure of any one of these does not cause the ensemble
to lose a majority of its servers.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
HBase is a distributed column-oriented database built on top of HDFS. HBase is the
Hadoop application to use when you require real-time read/write random-access to
very large datasets.
The canonical HBase use case is the webtable, a table of crawled web pages and their
attributes (such as language and MIME type) keyed by the web page URL. The webtable
is large, with row counts that run into the billions. Batch analytic and parsing
MapReduce jobs are continuously run against the webtable deriving statistics and
adding new columns of verified MIME type and parsed text content for later indexing
by a search engine.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
Concepts
Data Model
Applications store data into labeled tables. Tables are made of rows and columns. Table
cells—the intersection of row and column coordinates—are versioned. By default, their
version is a timestamp auto-assigned by HBase at the time of cell insertion. A cell’s
content is an uninterpreted array of bytes.
Regions
Tables are automatically partitioned horizontally by HBase into regions. Each region
comprises a subset of a table’s rows. A region is denoted by the table it belongs to, its
first row, inclusive, and last row, exclusive.
Locking
Row updates are atomic, no matter how many row columns constitute the row-level
HBase modeled with an HBase master node orchestrating a cluster of one or more
regionserver slaves. The HBase master is responsible for bootstrapping
a virgin install, for assigning regions to registered regionservers, and for recovering
regionserver failures. The master node is lightly loaded.HBase depends on ZooKeeper and by default it manages a ZooKeeper
instance as the authority on cluster state.
transaction. This keeps the locking model simple.
Installation
Download a stable release from an Apache Download Mirror and unpack it on your
local filesystem.
Java
HBase, like Hadoop, is written in Java.
MapReduce
HBase classes and utilities in the org.apache.hadoop.hbase.mapreduce package facilitate
using HBase as a source and/or sink in MapReduce jobs.
Avro, REST, and Thrift
HBase ships with Avro, REST, and Thrift interfaces. These are useful when the interacting
application is written in a language other than Java. In all cases, a Java server
hosts an instance of the HBase client brokering application Avro, REST, and Thrift
requests in and out of the HBase cluster.
Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data that Facebook
stored in HDFS. Today, Hive is a successful Apache project used by many organizations
as a general-purpose, scalable data processing platform.
SQL is the lingua franca in business intelligence tools (ODBC is a common bridge, for
example), so Hive is well placed to integrate with these products.
Installing Hive
In normal use, Hive runs on your workstation and converts your SQL query into a series
of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables,
which provide a means for attaching structure to data stored in HDFS. Metadata—
such as table schemas—is stored in a database called the metastore.
Installation of Hive is straightforward. Java 6 is a prerequisite; and on Windows, you
will need Cygwin, too. You also need to have the same version of Hadoop installed
locally that your cluster is running.
Download a release at http://hive.apache.org/releases.html, and unpack the tarball in a
suitable place on your workstation:
The Hive Shell
The shell is the primary way that we will interact with Hive, by issuing commands in
HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by
MySQL, so if you are familiar with MySQL you should feel at home using Hive.
Configuring Hive
Hive is configured using an XML configuration file like Hadoop’s. The file is called
hive-site.xml and is located in Hive’s conf directory. This file is where you can set properties
that you want to set every time you run Hive. The same directory contains hivedefault.
xml, which documents the properties that Hive exposes and their default values.
The Metastore
The metastore is the central repository of Hive metadata. The metastore is divided into
two pieces: a service and the backing store for the data. By default, the metastore service
runs in the same JVM as the Hive service and contains an embedded Derby database
instance backed by the local disk. This is called the embedded metastore configuration.
Data Types
Hive supports both primitive and complex data types. Primitives include numeric,
boolean, string, and timestamp types. The complex data types include arrays, maps,
and structs.
Operators and Functions
The usual set of SQL operators is provided by Hive: relational operators (such as x =
'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern matching),
arithmetic operators (such as x + 1 for addition), and logical operators (such as x OR y
for logical OR).
Tables
A Hive table is logically made up of the data being stored and the associated metadata
describing the layout of the data in the table. The data typically resides in HDFS, although
it may reside in any Hadoop filesystem, including the local filesystem or S3.
Hive stores the metadata in a relational database—and not in HDFS
Managed Tables and External Tables
When you create a table in Hive, by default Hive will manage the data, which means
that Hive moves the data into its warehouse directory. Alternatively, you may create
an external table, which tells Hive to refer to the data that is at an existing location
outside the warehouse directory.
Partitions and Buckets
Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts
based on the value of a partition column, such as date. Using partitions can make it
faster to do queries on slices of the data.
Tables or partitions may further be subdivided into buckets, to give extra structure to
the data that may be used for more efficient queries.
Storage Formats
There are two dimensions that govern table storage in Hive: the row format and the
file format. The row format dictates how rows, and the fields in a particular row, are
stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word
for a Serializer-Deserializer.
The file format dictates the container format for fields in a row. The simplest format is
a plain text
Importing Data
The LOAD DATA operation can be used to import data into a Hive table
(or partition) by copying or moving files to the table’s directory. One can also populate
a table with data from another Hive table using an INSERT statement, or at creation time
using the CTAS construct, which is an abbreviation used to refer to CREATE TABLE...AS
SELECT.
Altering Tables
Since Hive uses the schema on read approach, it’s flexible in permitting a table’s definition
to change after the table has been created. The general caveat, however, is that
it is up to you, in many cases, to ensure that the data is changed to reflect the new
structure.
Dropping Tables
The DROP TABLE statement deletes the data and metadata for a table. In the case of
external tables, only the metadata is deleted—the data is left untouched.
Querying Data
MapReduce Scripts
Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE clauses make
it possible to invoke an external script or program from Hive.
Joins
One of the nice things about using Hive, rather than raw MapReduce, is that it makes
performing commonly used operations very simple. Join operations are a case in point,
given how involved they are to implement in MapReduce
Inner joins
The simplest kind of join is the inner join, where each match in the input tables results
in a row in the output.
Outer joins
Outer joins allow you to find nonmatches in the tables being joined.
Map joins
If one table is small enough to fit in memory, then Hive can load the smaller table into
memory to perform the join in each of the mappers.
Subqueries
A subquery is a SELECT statement that is embedded in another SQL statement. Hive has
limited support for subqueries, only permitting a subquery in the FROM clause of a
SELECT statement.
Views
A view is a sort of “virtual table” that is defined by a SELECT statement. Views can be
used to present data to users in a different way to the way it is actually stored on disk.
User-Defined Functions
Sometimes the query you want to write can’t be expressed easily (or at all) using the
built-in functions that Hive provides. By writing a user-defined function (UDF), Hive
makes it easy to plug in your own processing code and invoke it from a Hive query.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
Big Data - An Introduction to Hive and HQL
IBM Analytics
_________________
_________________
Understanding Hive In Depth | Hive Tutorial for Beginners | Apache Hive Explained With Hive Commands
Pig is a scripting language for exploring large datasets. One criticism of MapReduce is
that the development cycle is very long. Writing the mappers and reducers, compiling
and packaging the code, submitting the job(s), and retrieving the results is a timeconsuming
business, and even with Streaming, which removes the compile and package
step, the experience is still involved. Pig’s sweet spot is its ability to process terabytes
of data simply by issuing a half-dozen lines of Pig Latin from the console. Writing queries in
Pig Latin will save you time.
Installing and Running Pig
Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster,
there is nothing extra to install on the cluster: Pig launches jobs and interacts with
HDFS (or other Hadoop filesystems) from your workstation.
Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will need
Cygwin). Download a stable release from http://pig.apache.org/releases.html, and unpack
the tarball in a suitable place on your workstation:
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a
Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster. MapReduce
mode (with a fully distributed cluster) is what you use when you want to run Pig on
large datasets.
you need to point Pig at the cluster’s namenode and jobtracker. If the installation
of Hadoop at HADOOP_HOME is already configured for this, then there is nothing more to
do. Otherwise, you can set HADOOP_CONF_DIR to a directory containing the Hadoop site
file (or files) that define fs.default.name and mapred.job.tracker.
Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig,
Running Pig Programs
Script
Pig can run a script file that contains Pig commands. For example, pig
script.pig runs the commands in the local file script.pig. Alternatively, for very
short scripts, you can use the -e option to run a script specified as a string on the
command line.
Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to
run Pig scripts from within Grunt using run and exec.
Embedded
You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.
Pig Latin Editors
PigPen is an Eclipse plug-in that provides an environment for developing Pig programs.
It includes a Pig script text editor, an example generator (equivalent to the ILLUSTRATE
command), and a button for running the script on a Hadoop cluster.
Pig Latin is a data flow programming language,
whereas SQL is a declarative programming language. In other words, a Pig Latin program
is a step-by-step set of operations on an input relation, in which each step is a
single transformation. By contrast, SQL statements are a set of constraints that, taken
together, define the output.
Pig Latin Reference Manual 2
Overview
Conventions
Reserved Keywords
Data Types and More
Relations, Bags, Tuples, Fields
Data Types
Nulls
Constants
Expressions
Schemas
Parameter Substitution
Arithmetic Operators and More
Arithmetic Operators
Comparison Operators
Null Operators
Boolean Operators
Dereference Operators
Sign Operators
Flatten Operator
Cast Operators
Casting Relations to Scalars
Relational Operators
COGROUP
CROSS
DISTINCT
FILTER
FOREACH
GROUP
JOIN (inner)
JOIN (outer)
LIMIT
LOAD
MAPREDUCE
ORDER BY
SAMPLE
SPLIT
STORE
STREAM
UNION
Diagnostic Operators
DESCRIBE
DUMP
EXPLAIN
ILLUSTRATE
UDF Statements
DEFINE
REGISTER
Eval Functions
AVG
CONCAT
Example
COUNT
COUNT_STAR
DIFF
IsEmpty
MAX
MIN
SIZE
SUM
TOKENIZE
Load/Store Functions
Handling Compression
BinStorage
PigStorage
PigDump
TextLoader
Math Functions
ABS
ACOS
ASIN
ATAN
CBRT
CEIL
COSH
COS
EXP
FLOOR
LOG
LOG10
RANDOM
ROUND
SIN
SINH
SQRT
TAN
TANH
String Functions
INDEXOF
LAST_INDEX_OF
LCFIRST
LOWER
REGEX_EXTRACT
REGEX_EXTRACT_ALL
REPLACE
STRSPLIT
SUBSTRING
TRIM
UCFIRST
UPPER
Bag and Tuple Functions
TOBAG
TOP
TOTUPLE
File Commands
cat
cd
copyFromLocal
copyToLocal
cp
ls
mkdir
mv
pwd
rm
rmf
Shell Commands
fs
sh
Utility Commands
exec
help
kill
quit
run
set
Setting Up a Hadoop Cluster
Important Points of Important Issues
Apache Hadoop Installing & configuring multi node hadoop cluster
Comware Labs
_________________
_________________
How to setup Hadoop Cluster and configure Size?
Hadoop BigData Online Training
_________________
_________________
Running
To do useful work, Hadoop needs to run on multiple nodes.
Cluster Specification
Hadoop is designed to run on commodity hardware. That means one can choose standardized,
commonly available hardware from any of a large range of vendors to build a cluster.
Mid-2010 specifications for the processors (from Tom White)
The bulk of Hadoop is written in Java, and can therefore run on any platform with a JVM, although there are enough parts that harbor Unix assumptions (the control scripts, for example) to make it unwise to run on a non-Unix platform in production.
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the jobtracker on a single master machine (as long as at least one copy of the namenode’s metadata is stored on a remote filesystem). As the cluster and the number of files stored in HDFS grow, the namenode needs more memory, so the namenode and jobtracker should be moved onto separate machines.
A common Hadoop cluster architecture consists of a two-level network topology. Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack, and an uplink to a core switch or
router (which is normally 1 GB or better). The salient point is that the aggregate bandwidth
between nodes on the same rack is much greater than that between nodes on
different racks.
Network locations such as nodes and racks are represented in a tree, which reflects the
network “distance” between locations. The namenode uses the network location when
determining where to place block replicas; the MapReduce scheduler uses network location to determine where the closest replica is as input to a map task.
To ease the burden of installing and maintaining the same software on each node, it is
normal to use an automated installation method like Red Hat Linux’s Kickstart or
Debian’s Fully Automatic Installation. These tools allow you to automate the operating
system installation by recording the answers to questions that are asked during the
installation process (such as the disk partition layout), as well as which packages to
install.
Installing Java
Java 6 or later is required to run Hadoop. The latest stable Sun JDK is the preferred
option, although Java distributions from other vendors may work, too.
Creating a Hadoop User
It’s good practice to create a dedicated Hadoop user account to separate the Hadoop
installation from other services running on the same machine.
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page (http://hadoop.apache.org/
core/releases.html), and unpack the contents of the distribution in a sensible location,
such as /usr/local (/opt is another standard choice). Note that Hadoop is not installed
in the hadoop user’s home directory, as that may be an NFS-mounted directory:
SSH Configuration
The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide
operations. For example, there is a script for stopping and starting all the daemons in
the cluster. Note that the control scripts are optional—cluster-wide operations can be
performed by other mechanisms, too (such as a distributed shell).
To work seamlessly, SSH needs to be set up to allow password-less login for the
hadoop user from machines in the cluster. The simplest way to achieve this is to generate
a public/private key pair, and place it in an NFS location that is shared across the cluster.
Configuration Management
Hadoop does not have a single, global location for configuration information. Instead,
each Hadoop node in the cluster has its own set of configuration files, and it is up to
administrators to ensure that they are kept in sync across the system. Hadoop provides
a rudimentary facility for synchronizing configuration using rsync;
alternatively, there are parallel shell tools that can help do this, like dsh or
pdsh.
Control scripts
Hadoop comes with scripts for running commands, and starting and stopping daemons
across the whole cluster. To use these scripts (which can be found in the bin directory),
you need to tell Hadoop which machines are in the cluster. There are two files for this
purpose, called masters and slaves, each of which contains a list of the machine hostnames
or IP addresses, one per line.
Master node scenarios
Depending on the size of the cluster, there are various configurations for running the
master daemons: the namenode, secondary namenode, and jobtracker.
Environment Settings
Memory
By default, Hadoop allocates 1,000 MB (1 GB) of memory to each daemon it runs. This
is controlled by the HADOOP_HEAPSIZE setting in hadoop-env.sh. In addition, the task
tracker launches separate child JVMs to run map and reduce tasks.
Java
The location of the Java implementation to use is determined by the JAVA_HOME setting
in hadoop-env.sh or from the JAVA_HOME shell environment variable, if not set in hadoopenv.
sh.
System logfiles
System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default.
This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
SSH settings
The control scripts allow you to run commands on (remote) worker nodes from the
master node using SSH. It can be useful to customize the SSH settings,
Important Hadoop Daemon Properties
Hadoop has a bewildering number of configuration properties. You need to define some and have to understand why the default is
appropriate for any real-world working cluster.
HDFS
To run HDFS, you need to designate one machine as a namenode. In this case, the
property fs.default.name is an HDFS filesystem URI, whose host is the namenode’s
hostname or IP address, and port is the port that the namenode will listen on for RPCs.
If no port is specified, the default of 8020 is used.
MapReduce
To run MapReduce, you need to designate one machine as a jobtracker, which on small
clusters may be the same machine as the namenode. To do this, set the
mapred.job.tracker property to the hostname or IP address and port that the jobtracker
will listen on. Note that this property is not a URI, but a host-port pair, separated by
a colon. The port number 8021 is a common choice.
Hadoop Daemon Addresses and Ports
Hadoop daemons generally run both an RPC server for communication
between daemons and an HTTP server to provide web pages for human consumption
. Each server is configured by setting the network address and port number
to listen on. By specifying the network address as 0.0.0.0, Hadoop will bind to all
addresses on the machine. Alternatively, you can specify a single address to bind to. A
port number of 0 instructs the server to start on a free port: this is generally discouraged,
since it is incompatible with setting cluster-wide firewall policies.
Other Hadoop Properties
Some other properties that you might consider setting.
Cluster membership
To aid the addition and removal of nodes in the future, you can specify a file containing
a list of authorized machines that may join the cluster as datanodes or tasktrackers.
Buffer size
Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a conservative
setting, and with modern hardware and operating systems, you will likely see
performance benefits by increasing it; 128 KB (131,072 bytes) is a common choice. Set
this using the io.file.buffer.size property in core-site.xml.
HDFS block size
The HDFS block size is 64 MB by default, but many clusters use 128 MB (134,217,728
bytes) or even 256 MB (268,435,456 bytes) to ease memory pressure on the namenode
and to give mappers more data to work on. Set this using the dfs.block.size property
in hdfs-site.xml.
Reserved storage space
By default, datanodes will try to use all of the space available in their storage directories.
If you want to reserve some space on the storage volumes for non-HDFS use, then you
can set dfs.datanode.du.reserved to the amount, in bytes, of space to reserve.
Trash
Hadoop filesystems have a trash facility, in which deleted files are not actually deleted,
but rather are moved to a trash folder, where they remain for a minimum period before
being permanently deleted by the system. The minimum period in minutes that a file
will remain in the trash is set using the fs.trash.interval configuration property in
core-site.xml.
Job scheduler
Particularly in a multiuser MapReduce setting, consider changing the default FIFO job
scheduler to one of the more fully featured alternatives.
Reduce slow start
By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs this can cause problems with
cluster utilization, since they take up reduce slots while waiting for the map tasks to
complete. Setting mapred.reduce.slowstart.completed.maps to a higher value, such as
0.80 (80%), can help improve throughput.
Task memory limits
Hadoop provides two mechanisms for this. The simplest is via the Linux
ulimit command, which can be done at the operating system level (in the limits.conf
file, typically found in /etc/security), or by setting mapred.child.ulimit in the Hadoop
configuration. The value is specified in kilobytes, and should be comfortably larger
than the memory of the JVM set by mapred.child.java.opts; otherwise, the child JVM
might not start.
User Account Creation
Once you have a Hadoop cluster up and running, you need to give users access to it.
This involves creating a home directory for each user and setting ownership permissions
on it:
YARN Configuration
The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the
cluster. This script will start a resource manager (on the machine the script is run on),
and a node manager on each machine listed in the slaves file.
Security Enhancements
Security has been tightened throughout HDFS and MapReduce to protect against unauthorized
access to resources.
Benchmarking a Hadoop Cluster
Is the cluster set up correctly? The best way to answer this question is empirically: run
some jobs and confirm that you get the expected results. Benchmarks make good tests,
as you also get numbers that you can compare with other clusters as a sanity check on
whether your new cluster is performing roughly as expected.
Hadoop Benchmarks
Hadoop comes with several benchmarks that you can run very easily with minimal
setup cost. Benchmarks are packaged in the test JAR file, and you can get a list of them,
with descriptions, by invoking the JAR file with no arguments:
Hadoop in the Cloud
Cloudera offers tools for running Hadoop in a public or private cloud, and Amazon
has a Hadoop cloud service called Elastic MapReduce.
The Apache Whirr project (http://whirr.apache.org/) provides a Java API and a set of
scripts that make it easy to run Hadoop on EC2 and other cloud providers.The scripts
allow you to perform such operations as launching or terminating a cluster, or listing
the running instances in a cluster.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
Counters
There are often things you would like to know about the data you are analyzing but
that are peripheral to the analysis you are performing.Counters are a useful channel for gathering statistics about the job: for quality control
or for application level-statistics. They are also useful for problem diagnosis.
Built-in Counters
Hadoop maintains some built-in counters for every job, which report various metrics
for your job. For example, there are counters for the number of bytes and records
processed, which allows you to confirm that the expected amount of input was consumed
and the expected amount of output was produced.
Task counters
Task counters gather information about tasks over the course of their execution, and
the results are aggregated over all the tasks in a job. For example, the
MAP_INPUT_RECORDS counter counts the input records read by each map task and aggregates
over all map tasks in a job, so that the final figure is the total number of input
records for the whole job.
Job counters
Job counters are maintained by the jobtracker (or application master in
YARN), so they don’t need to be sent across the network, unlike all other counters,
including user-defined ones. They measure job-level statistics, not values that change
while a task is running. For example, TOTAL_LAUNCHED_MAPS counts the number of map
tasks that were launched over the course of a job (including ones that failed).
User-Defined Java Counters
MapReduce allows user code to define a set of counters, which are then incremented
as desired in the mapper or reducer. Counters are defined by a Java enum, which serves
to group related counters. A job may define an arbitrary number of enums, each with
an arbitrary number of fields. The name of the enum is the group name, and the enum’s
fields are the counter names. Counters are global: the MapReduce framework aggregates
them across all maps and reduces to produce a grand total at the end of the job.
Hadoop Map Reduce Development - Counters - Introduction
itversity
_______________
_______________
Sorting
The ability to sort data is at the heart of MapReduce. Even if your application isn’t
concerned with sorting per se, it may be able to use the sorting stage that MapReduce
provides to organize its data.
Partial Sort
Total Sort
It is possible to produce a set of sorted files that, if concatenated, would form
a globally sorted file. The secret to doing this is to use a partitioner that respects the
total order of the output.
Secondary Sort
Streaming
To do a secondary sort in Streaming, we can take advantage of a couple of library classes
that Hadoop provides.
Joins
MapReduce can perform joins between large datasets, but writing the code to do joins
from scratch is fairly involved. Rather than writing MapReduce programs, you might
consider using a higher-level framework such as Pig, Hive, or Cascading, in which join
operations are a core part of the implementation.
Map-Side Joins
A map-side join between large inputs works by performing the join before the data
reaches the map function. For this to work, though, the inputs to each map must be
partitioned and sorted in a particular way. Each input dataset must be divided into the
same number of partitions, and it must be sorted by the same key (the join key) in each
source. All the records for a particular key must reside
in the same partition.
Reduce-Side Joins
A reduce-side join is more general than a map-side join, in that the input datasets don’t
have to be structured in any particular way, but it is less efficient as both datasets have
to go through the MapReduce shuffle. The basic idea is that the mapper tags each record
with its source and uses the join key as the map output key, so that the records with
the same key are brought together in the reducer.
Joins in Hadoop Mapreduce | Mapside Joins | Reduce Side Joins | Hadoop Mapreduce Tutorial
edureka!
_________________
_________________
Side Data Distribution
Side data can be defined as extra read-only data needed by a job to process the main
dataset. The challenge is to make side data available to all the map or reduce tasks
(which are spread across the cluster) in a convenient and efficient fashion.
Distributed Cache
Rather than serializing side data in the job configuration, it is preferable to distribute
datasets using Hadoop’s distributed cache mechanism. This provides a service for
copying files and archives to the task nodes in time for the tasks to use them when they
run. To save network bandwidth, files are normally copied to any particular node once
per job.
MapReduce Library Classes
Hadoop comes with a library of mappers and reducers for commonly used functions.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
MapReduce has a simple model of data processing: inputs and outputs for the map and
reduce functions are key-value pairs. This chapter looks at the MapReduce model in
detail and, in particular, how data in various formats, from simple text to structured
binary objects, can be used with this model.
MapReduce Types
The map and reduce functions in Hadoop MapReduce have the following general form:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
In general, the map input key and value types (K1 and V1) are different from the map
output types (K2 and V2). However, the reduce input must have the same types as the
map output, although the reduce output types may be different again (K3 and V3).
Input Formats
Hadoop can process many different types of data formats, from flat text files to databases.
Input Splits and Records
As we saw in Chapter 2, an input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn. Splits and records are logical:
there is nothing that requires them to be tied to files, for example, although in their
most common incarnations, they are. In a database context, a split might correspond
to a range of rows from a table and a record to a row in that range
FileInputFormat
FileInputFormat is the base class for all implementations of InputFormat that use files
as their data source. It provides two things: a place to define which files
are included as the input to a job, and an implementation for generating splits for the
input files. The job of dividing splits into records is performed by subclasses.
Small files and CombineFileInputFormat
Hadoop works better with a small number of large files than a large number of small
files. One reason for this is that FileInputFormat generates splits in such a way that each
split is all or part of a single file. If the file is very small (“small” means significantly
smaller than an HDFS block) and there are a lot of them, then each map task will process
very little input, and there will be a lot of them (one per file), each of which imposes
extra bookkeeping overhead.
Text Input
Hadoop excels at processing unstructured text. Different InputFormats are provided to process text in Hadoop.
TextInputFormat
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a
LongWritable, is the byte offset within the file of the beginning of the line. The value is
the contents of the line, excluding any line terminators (newline, carriage return), and
is packaged as a Text object.
If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use. Like TextInputFormat, the keys are the byte
offsets within the file and the values are the lines themselves. N refers to the number of lines of input that each mapper receives. With N set to
one (the default), each mapper receives exactly one line of input. The mapre
duce.input.lineinputformat.linespermap property (mapred.line.input.format.line
spermap in the old API) controls the value of N.
Binary Input
Hadoop MapReduce is not just restricted to processing textual data—it has support
for binary formats, too.
SequenceFileInputFormat
Hadoop’s sequence file format stores sequences of binary key-value pairs. Sequence
files are well suited as a format for MapReduce data since they are splittable (they have
sync points so that readers can synchronize with record boundaries from an arbitrary
point in the file, such as the start of a split), they support compression as a part of the
format, and they can store arbitrary types using a variety of serialization frameworks.
Database Input (and Output)
DBInputFormat is an input format for reading data from a relational database, using
JDBC. Because it doesn’t have any sharding capabilities, you need to be careful not to
overwhelm the database you are reading from by running too many mappers. For this
reason, it is best used for loading relatively small datasets, perhaps for joining with
larger datasets from HDFS, using MultipleInputs. The corresponding output format is
DBOutputFormat, which is useful for dumping job outputs (of modest size) into a
database.
Output Formats
Hadoop has output data formats that correspond to the input formats
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
MapReduce Types and Formats in hadoop
videoonlinelearning
________________