Computer Science and Engineering Knowledge Center: Introduction to Hive

Hive

Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data that Facebook
stored in HDFS. Today, Hive is a successful Apache project used by many organizations
as a general-purpose, scalable data processing platform.

SQL is the lingua franca in business intelligence tools (ODBC is a common bridge, for
example), so Hive is well placed to integrate with these products.

Installing Hive
In normal use, Hive runs on your workstation and converts your SQL query into a series
of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables,
which provide a means for attaching structure to data stored in HDFS. Metadata—
such as table schemas—is stored in a database called the metastore.

Installation of Hive is straightforward. Java 6 is a prerequisite; and on Windows, you
will need Cygwin, too. You also need to have the same version of Hadoop installed
locally that your cluster is running.

Download a release at http://hive.apache.org/releases.html, and unpack the tarball in a
suitable place on your workstation:

The Hive Shell
The shell is the primary way that we will interact with Hive, by issuing commands in
HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by
MySQL, so if you are familiar with MySQL you should feel at home using Hive.

Configuring Hive
Hive is configured using an XML configuration file like Hadoop’s. The file is called
hive-site.xml and is located in Hive’s conf directory. This file is where you can set properties
that you want to set every time you run Hive. The same directory contains hivedefault.
xml, which documents the properties that Hive exposes and their default values.

The Metastore
The metastore is the central repository of Hive metadata. The metastore is divided into
two pieces: a service and the backing store for the data. By default, the metastore service
runs in the same JVM as the Hive service and contains an embedded Derby database
instance backed by the local disk. This is called the embedded metastore configuration.

Data Types
Hive supports both primitive and complex data types. Primitives include numeric,
boolean, string, and timestamp types. The complex data types include arrays, maps,
and structs.

Operators and Functions
The usual set of SQL operators is provided by Hive: relational operators (such as x =
'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern matching),
arithmetic operators (such as x + 1 for addition), and logical operators (such as x OR y
for logical OR).

Tables
A Hive table is logically made up of the data being stored and the associated metadata
describing the layout of the data in the table. The data typically resides in HDFS, although
it may reside in any Hadoop filesystem, including the local filesystem or S3.
Hive stores the metadata in a relational database—and not in HDFS

Managed Tables and External Tables
When you create a table in Hive, by default Hive will manage the data, which means
that Hive moves the data into its warehouse directory. Alternatively, you may create
an external table, which tells Hive to refer to the data that is at an existing location
outside the warehouse directory.

Partitions and Buckets
Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts
based on the value of a partition column, such as date. Using partitions can make it
faster to do queries on slices of the data.
Tables or partitions may further be subdivided into buckets, to give extra structure to
the data that may be used for more efficient queries.

Storage Formats
There are two dimensions that govern table storage in Hive: the row format and the
file format. The row format dictates how rows, and the fields in a particular row, are
stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word
for a Serializer-Deserializer.

The file format dictates the container format for fields in a row. The simplest format is
a plain text

Importing Data
The LOAD DATA operation can be used to import data into a Hive table
(or partition) by copying or moving files to the table’s directory. One can also populate
a table with data from another Hive table using an INSERT statement, or at creation time
using the CTAS construct, which is an abbreviation used to refer to CREATE TABLE...AS
SELECT.

Altering Tables
Since Hive uses the schema on read approach, it’s flexible in permitting a table’s definition
to change after the table has been created. The general caveat, however, is that
it is up to you, in many cases, to ensure that the data is changed to reflect the new
structure.

Dropping Tables
The DROP TABLE statement deletes the data and metadata for a table. In the case of
external tables, only the metadata is deleted—the data is left untouched.

Querying Data

MapReduce Scripts
Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE clauses make
it possible to invoke an external script or program from Hive.

Joins
One of the nice things about using Hive, rather than raw MapReduce, is that it makes
performing commonly used operations very simple. Join operations are a case in point,
given how involved they are to implement in MapReduce

Inner joins
The simplest kind of join is the inner join, where each match in the input tables results
in a row in the output.

Outer joins
Outer joins allow you to find nonmatches in the tables being joined.

Map joins
If one table is small enough to fit in memory, then Hive can load the smaller table into
memory to perform the join in each of the mappers.

Subqueries
A subquery is a SELECT statement that is embedded in another SQL statement. Hive has
limited support for subqueries, only permitting a subquery in the FROM clause of a
SELECT statement.

Views
A view is a sort of “virtual table” that is defined by a SELECT statement. Views can be
used to present data to users in a different way to the way it is actually stored on disk.

User-Defined Functions
Sometimes the query you want to write can’t be expressed easily (or at all) using the
built-in functions that Hive provides. By writing a user-defined function (UDF), Hive
makes it easy to plug in your own processing code and invoke it from a Hive query.

Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly