Tuesday, May 10, 2016

Introduction to Sqoop


Introduction to Sqoop


Sqoop is an open-source tool that allows
users to extract data from a relational database into Hadoop for further processing.
This processing can be done with MapReduce programs or other higher-level tools such
as Hive. (It’s even possible to use Sqoop to move data from a relational database into
HBase.) When the final results of an analytic pipeline are available, Sqoop can export
these results back to the database for consumption by other clients.

Getting Sqoop
Sqoop is available in a few places. The primary home of the project is http://incubator
.apache.org/sqoop/. This repository contains all the Sqoop source code and documentation.
Official releases are available at this site, as well as the source code for the version
currently under development. The repository itself contains instructions for compiling
the project.

After you install Sqoop, you can use it to import data to Hadoop.
Sqoop imports from databases. The list of databases that it has been tested with includes
MySQL, PostgreSQL, Oracle, SQL Server and DB2.

By default, Sqoop will generate comma-delimited text files for our imported data. Delimiters
can be explicitly specified, as well as field enclosing and escape characters to
allow the presence of delimiters in the field contents. The command-line arguments
that specify delimiter characters, file formats, compression, and more fine-grained
control of the import process are described in the Sqoop User Guide distributed with

Controlling the Import
Sqoop does not need to import an entire table at a time. For example, a subset of the
table’s columns can be specified for import. Users can also specify a WHERE clause to
include in queries, which bound the rows of the table to import

Working with Imported Data
Once data has been imported to HDFS, it is now ready for processing by custom Map-
Reduce programs. Text-based imports can be easily used in scripts run with Hadoop
Streaming or in MapReduce jobs run with the default TextInputFormat.

Imported Data and Hive
Using a system like Hive to handle
relational operations can dramatically ease the development of the analytic pipeline.
Especially for data originally from a relational data source, using Hive makes a lot of
sense. Hive and Sqoop together form a powerful toolchain for performing analysis.

Excerpts from  Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly

Hadoop Notes and Video Lectures

What is Hadoop? Text and Video Lectures

What is MapReduce? Text and Video Lectures

The Hadoop Distributed Filesystem (HDFS)

Hadoop Input - Output System

No comments:

Post a Comment