Computer Science and Engineering Knowledge Center: Hadoop Input

Hadoop I/O

Hadoop comes with a set of primitives for data I/O.

Data Integrity
Users of Hadoop rightly expect that no data will be lost or corrupted during storage or
processing. For this purpose, a commonly used error-detecting code is CRC-32 (cyclic redundancy check), which computes a 32-bit integer checksum for input of any size. Data Integrity in HDFS
HDFS transparently checksums all data written to it and by default verifies checksums
when reading data. A separate checksum is created for every io.bytes.per.checksum bytes of data.

Datanodes are responsible for verifying the data they receive before storing the data
and its checksum. This applies to data that they receive from clients and from other
datanodes during replication. When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanode. Since HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good replicas to produce a new, uncorrupt replica.

LocalFileSystem
The Hadoop LocalFileSystem performs client-side checksumming. This means that
when you write a file called filename, the filesystem client transparently creates a hidden
file, .filename.crc, in the same directory containing the checksums for each chunk of
the file.

ChecksumFileSystem
LocalFileSystem uses ChecksumFileSystem to do its work, and this class makes it easy
to add checksumming to other (nonchecksummed) filesystems, as Checksum
FileSystem is just a wrapper around FileSystem.

Compression
File compression brings two major benefits: it reduces the space needed to store files,
and it speeds up data transfer across the network, or to or from disk.

Codecs
A codec is the implementation of a compression-decompression algorithm. In Hadoop,
a codec is represented by an implementation of the CompressionCodec interface.

Compressing and decompressing streams with CompressionCodec

CompressionCodec has two methods that allow you to easily compress or decompress
data. To compress data being written to an output stream, use the createOutput
Stream(OutputStream out) method to create a CompressionOutputStream to which you
write your uncompressed data to have it written in compressed form to the underlying
stream. Conversely, to decompress data being read from an input stream, call
createInputStream(InputStream in) to obtain a CompressionInputStream, which allows
you to read uncompressed data from the underlying stream.

Inferring CompressionCodecs using CompressionCodecFactory

If you are reading a compressed file, you can normally infer the codec to use by looking
at its filename extension. A file ending in .gz can be read with GzipCodec, and so on.

Compression and Input Splits
When considering how to compress data that will be processed by MapReduce, it is
important to understand whether the compression format supports splitting.

Using Compression in MapReduce
If your input files are compressed, they will be automatically
decompressed as they are read by MapReduce, using the filename extension to determine
the codec to use.

To compress the output of a MapReduce job, in the job configuration, set the
mapred.output.compress property to true and the mapred.output.compression.codec
property to the classname of the compression codec you want to use.

Serialization
Serialization is the process of turning structured objects into a byte stream for transmission
over a network or for writing to persistent storage. Deserialization is the reverse
process of turning a byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage.

The Writable Interface
The Writable interface defines two methods: one for writing its state to a DataOutput
binary stream, and one for reading its state from a DataInput binary stream:

Writable Classes
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io
package.

Text
Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent
of java.lang.String.

Writable collections
There are six Writable collection types in the org.apache.hadoop.io package: Array
Writable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWrita
ble, and EnumSetWritable.

Avro
Apache Avro is a language-neutral data serialization system. The project was created
by Doug Cutting (the creator of Hadoop) to address the major downside of Hadoop
Writables: lack of language portability.

File-Based Data Structures
For some applications, you need a specialized data structure to hold your data. For
doing MapReduce-based processing, putting each blob of binary data into its own file
doesn’t scale, so Hadoop developed a number of higher-level containers for these
situations.

Hadoop Input and Output format from www.HadoopExam.com
hadoop pass
__________________

__________________

What is Hadoop SequenceFile?
hadoop pass
__________________

__________________

Hadoop Notes and Video Lectures

What is Hadoop? Text and Video Lectures

What is MapReduce? Text and Video Lectures

The Hadoop Distributed Filesystem (HDFS)

Hadoop Input - Output System

Developing a Simple MapReduce Application

Computer Science and Engineering Knowledge Center

Pages

Tuesday, May 10, 2016

Hadoop Input - Output System

Hadoop Notes and Video Lectures

No comments:

Post a Comment

Total Pageviews