Setting Up a Hadoop Cluster
Important Points of Important Issues
_________________
_________________
_________________
_________________
Running
To do useful work, Hadoop needs to run on multiple nodes.
commonly available hardware from any of a large range of vendors to build a cluster.
Mid-2010 specifications for the processors (from Tom White)
2 quad-core 2-2.5GHz CPUs
Memory
16-24 GB ECC RAM1
Storage
4 × 1TB SATA disks
Network
Gigabit Ethernet
The bulk of Hadoop is written in Java, and can therefore run on any platform with a JVM, although there are enough parts that harbor Unix assumptions (the control scripts, for example) to make it unwise to run on a non-Unix platform in production.
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the jobtracker on a single master machine (as long as at least one copy of the namenode’s metadata is stored on a remote filesystem). As the cluster and the number of files stored in HDFS grow, the namenode needs more memory, so the namenode and jobtracker should be moved onto separate machines.
A common Hadoop cluster architecture consists of a two-level network topology. Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack, and an uplink to a core switch or
router (which is normally 1 GB or better). The salient point is that the aggregate bandwidth
between nodes on the same rack is much greater than that between nodes on
different racks.
Network locations such as nodes and racks are represented in a tree, which reflects the
network “distance” between locations. The namenode uses the network location when
determining where to place block replicas; the MapReduce scheduler uses network location to determine where the closest replica is as input to a map task.
To ease the burden of installing and maintaining the same software on each node, it is
normal to use an automated installation method like Red Hat Linux’s Kickstart or
Debian’s Fully Automatic Installation. These tools allow you to automate the operating
system installation by recording the answers to questions that are asked during the
installation process (such as the disk partition layout), as well as which packages to
install.
option, although Java distributions from other vendors may work, too.
Creating a Hadoop User
It’s good practice to create a dedicated Hadoop user account to separate the Hadoop
installation from other services running on the same machine.
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page (http://hadoop.apache.org/
core/releases.html), and unpack the contents of the distribution in a sensible location,
such as /usr/local (/opt is another standard choice). Note that Hadoop is not installed
in the hadoop user’s home directory, as that may be an NFS-mounted directory:
SSH Configuration
The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide
operations. For example, there is a script for stopping and starting all the daemons in
the cluster. Note that the control scripts are optional—cluster-wide operations can be
performed by other mechanisms, too (such as a distributed shell).
To work seamlessly, SSH needs to be set up to allow password-less login for the
hadoop user from machines in the cluster. The simplest way to achieve this is to generate
a public/private key pair, and place it in an NFS location that is shared across the cluster.
Configuration Management
Hadoop does not have a single, global location for configuration information. Instead,
each Hadoop node in the cluster has its own set of configuration files, and it is up to
administrators to ensure that they are kept in sync across the system. Hadoop provides
a rudimentary facility for synchronizing configuration using rsync;
alternatively, there are parallel shell tools that can help do this, like dsh or
pdsh.
Control scripts
Hadoop comes with scripts for running commands, and starting and stopping daemons
across the whole cluster. To use these scripts (which can be found in the bin directory),
you need to tell Hadoop which machines are in the cluster. There are two files for this
purpose, called masters and slaves, each of which contains a list of the machine hostnames
or IP addresses, one per line.
Master node scenarios
Depending on the size of the cluster, there are various configurations for running the
master daemons: the namenode, secondary namenode, and jobtracker.
Environment Settings
Memory
By default, Hadoop allocates 1,000 MB (1 GB) of memory to each daemon it runs. This
is controlled by the HADOOP_HEAPSIZE setting in hadoop-env.sh. In addition, the task
tracker launches separate child JVMs to run map and reduce tasks.
Java
The location of the Java implementation to use is determined by the JAVA_HOME setting
in hadoop-env.sh or from the JAVA_HOME shell environment variable, if not set in hadoopenv.
sh.
System logfiles
System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default.
This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
SSH settings
The control scripts allow you to run commands on (remote) worker nodes from the
master node using SSH. It can be useful to customize the SSH settings,
Important Hadoop Daemon Properties
Hadoop has a bewildering number of configuration properties. You need to define some and have to understand why the default is
appropriate for any real-world working cluster.
HDFS
To run HDFS, you need to designate one machine as a namenode. In this case, the
property fs.default.name is an HDFS filesystem URI, whose host is the namenode’s
hostname or IP address, and port is the port that the namenode will listen on for RPCs.
If no port is specified, the default of 8020 is used.
MapReduce
To run MapReduce, you need to designate one machine as a jobtracker, which on small
clusters may be the same machine as the namenode. To do this, set the
mapred.job.tracker property to the hostname or IP address and port that the jobtracker
will listen on. Note that this property is not a URI, but a host-port pair, separated by
a colon. The port number 8021 is a common choice.
Hadoop Daemon Addresses and Ports
Hadoop daemons generally run both an RPC server for communication
between daemons and an HTTP server to provide web pages for human consumption
. Each server is configured by setting the network address and port number
to listen on. By specifying the network address as 0.0.0.0, Hadoop will bind to all
addresses on the machine. Alternatively, you can specify a single address to bind to. A
port number of 0 instructs the server to start on a free port: this is generally discouraged,
since it is incompatible with setting cluster-wide firewall policies.
Other Hadoop Properties
Some other properties that you might consider setting.
Cluster membership
To aid the addition and removal of nodes in the future, you can specify a file containing
a list of authorized machines that may join the cluster as datanodes or tasktrackers.
Buffer size
Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a conservative
setting, and with modern hardware and operating systems, you will likely see
performance benefits by increasing it; 128 KB (131,072 bytes) is a common choice. Set
this using the io.file.buffer.size property in core-site.xml.
HDFS block size
The HDFS block size is 64 MB by default, but many clusters use 128 MB (134,217,728
bytes) or even 256 MB (268,435,456 bytes) to ease memory pressure on the namenode
and to give mappers more data to work on. Set this using the dfs.block.size property
in hdfs-site.xml.
Reserved storage space
By default, datanodes will try to use all of the space available in their storage directories.
If you want to reserve some space on the storage volumes for non-HDFS use, then you
can set dfs.datanode.du.reserved to the amount, in bytes, of space to reserve.
Trash
Hadoop filesystems have a trash facility, in which deleted files are not actually deleted,
but rather are moved to a trash folder, where they remain for a minimum period before
being permanently deleted by the system. The minimum period in minutes that a file
will remain in the trash is set using the fs.trash.interval configuration property in
core-site.xml.
Job scheduler
Particularly in a multiuser MapReduce setting, consider changing the default FIFO job
scheduler to one of the more fully featured alternatives.
Reduce slow start
By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs this can cause problems with
cluster utilization, since they take up reduce slots while waiting for the map tasks to
complete. Setting mapred.reduce.slowstart.completed.maps to a higher value, such as
0.80 (80%), can help improve throughput.
Task memory limits
Hadoop provides two mechanisms for this. The simplest is via the Linux
ulimit command, which can be done at the operating system level (in the limits.conf
file, typically found in /etc/security), or by setting mapred.child.ulimit in the Hadoop
configuration. The value is specified in kilobytes, and should be comfortably larger
than the memory of the JVM set by mapred.child.java.opts; otherwise, the child JVM
might not start.
User Account Creation
Once you have a Hadoop cluster up and running, you need to give users access to it.
This involves creating a home directory for each user and setting ownership permissions
on it:
YARN Configuration
The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the
cluster. This script will start a resource manager (on the machine the script is run on),
and a node manager on each machine listed in the slaves file.
Security Enhancements
Security has been tightened throughout HDFS and MapReduce to protect against unauthorized
access to resources.
Benchmarking a Hadoop Cluster
Is the cluster set up correctly? The best way to answer this question is empirically: run
some jobs and confirm that you get the expected results. Benchmarks make good tests,
as you also get numbers that you can compare with other clusters as a sanity check on
whether your new cluster is performing roughly as expected.
Hadoop Benchmarks
Hadoop comes with several benchmarks that you can run very easily with minimal
setup cost. Benchmarks are packaged in the test JAR file, and you can get a list of them,
with descriptions, by invoking the JAR file with no arguments:
Hadoop in the Cloud
Cloudera offers tools for running Hadoop in a public or private cloud, and Amazon
has a Hadoop cloud service called Elastic MapReduce.
The Apache Whirr project (http://whirr.apache.org/) provides a Java API and a set of
scripts that make it easy to run Hadoop on EC2 and other cloud providers.The scripts
allow you to perform such operations as launching or terminating a cluster, or listing
the running instances in a cluster.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
More references
Design Considerations in Building a Hadoop Cluster
Jan 28, 2016
https://www.linkedin.com/pulse/design-considerations-building-hadoop-cluster-madhumoy-dube
Spinning a Free Hadoop Cluster on Amazon Cloud
http://insightdataengineering.com/blog/hadoopdevops/
Building and Administering Hadoop Clusters
Fac. . Jordan Boyd Graber
2011 presentation
http://www.umiacs.umd.edu/~jbg/teaching/INFM_718_2011/lecture_10.pdf
What is Hadoop? Text and Video Lectures
What is MapReduce? Text and Video Lectures
The Hadoop Distributed Filesystem (HDFS)
Hadoop Input - Output System
Important Points of Important Issues
Apache Hadoop Installing & configuring multi node hadoop cluster
Comware Labs_________________
_________________
How to setup Hadoop Cluster and configure Size?
Hadoop BigData Online Training_________________
_________________
Running
To do useful work, Hadoop needs to run on multiple nodes.
Cluster Specification
Hadoop is designed to run on commodity hardware. That means one can choose standardized,commonly available hardware from any of a large range of vendors to build a cluster.
Mid-2010 specifications for the processors (from Tom White)
2 quad-core 2-2.5GHz CPUs
Memory
16-24 GB ECC RAM1
Storage
4 × 1TB SATA disks
Network
Gigabit Ethernet
The bulk of Hadoop is written in Java, and can therefore run on any platform with a JVM, although there are enough parts that harbor Unix assumptions (the control scripts, for example) to make it unwise to run on a non-Unix platform in production.
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the jobtracker on a single master machine (as long as at least one copy of the namenode’s metadata is stored on a remote filesystem). As the cluster and the number of files stored in HDFS grow, the namenode needs more memory, so the namenode and jobtracker should be moved onto separate machines.
A common Hadoop cluster architecture consists of a two-level network topology. Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack, and an uplink to a core switch or
router (which is normally 1 GB or better). The salient point is that the aggregate bandwidth
between nodes on the same rack is much greater than that between nodes on
different racks.
Network locations such as nodes and racks are represented in a tree, which reflects the
network “distance” between locations. The namenode uses the network location when
determining where to place block replicas; the MapReduce scheduler uses network location to determine where the closest replica is as input to a map task.
To ease the burden of installing and maintaining the same software on each node, it is
normal to use an automated installation method like Red Hat Linux’s Kickstart or
Debian’s Fully Automatic Installation. These tools allow you to automate the operating
system installation by recording the answers to questions that are asked during the
installation process (such as the disk partition layout), as well as which packages to
install.
Installing Java
Java 6 or later is required to run Hadoop. The latest stable Sun JDK is the preferredoption, although Java distributions from other vendors may work, too.
Creating a Hadoop User
It’s good practice to create a dedicated Hadoop user account to separate the Hadoop
installation from other services running on the same machine.
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page (http://hadoop.apache.org/
core/releases.html), and unpack the contents of the distribution in a sensible location,
such as /usr/local (/opt is another standard choice). Note that Hadoop is not installed
in the hadoop user’s home directory, as that may be an NFS-mounted directory:
SSH Configuration
The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide
operations. For example, there is a script for stopping and starting all the daemons in
the cluster. Note that the control scripts are optional—cluster-wide operations can be
performed by other mechanisms, too (such as a distributed shell).
To work seamlessly, SSH needs to be set up to allow password-less login for the
hadoop user from machines in the cluster. The simplest way to achieve this is to generate
a public/private key pair, and place it in an NFS location that is shared across the cluster.
Configuration Management
Hadoop does not have a single, global location for configuration information. Instead,
each Hadoop node in the cluster has its own set of configuration files, and it is up to
administrators to ensure that they are kept in sync across the system. Hadoop provides
a rudimentary facility for synchronizing configuration using rsync;
alternatively, there are parallel shell tools that can help do this, like dsh or
pdsh.
Control scripts
Hadoop comes with scripts for running commands, and starting and stopping daemons
across the whole cluster. To use these scripts (which can be found in the bin directory),
you need to tell Hadoop which machines are in the cluster. There are two files for this
purpose, called masters and slaves, each of which contains a list of the machine hostnames
or IP addresses, one per line.
Master node scenarios
Depending on the size of the cluster, there are various configurations for running the
master daemons: the namenode, secondary namenode, and jobtracker.
Environment Settings
Memory
By default, Hadoop allocates 1,000 MB (1 GB) of memory to each daemon it runs. This
is controlled by the HADOOP_HEAPSIZE setting in hadoop-env.sh. In addition, the task
tracker launches separate child JVMs to run map and reduce tasks.
Java
The location of the Java implementation to use is determined by the JAVA_HOME setting
in hadoop-env.sh or from the JAVA_HOME shell environment variable, if not set in hadoopenv.
sh.
System logfiles
System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default.
This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
SSH settings
The control scripts allow you to run commands on (remote) worker nodes from the
master node using SSH. It can be useful to customize the SSH settings,
Important Hadoop Daemon Properties
Hadoop has a bewildering number of configuration properties. You need to define some and have to understand why the default is
appropriate for any real-world working cluster.
HDFS
To run HDFS, you need to designate one machine as a namenode. In this case, the
property fs.default.name is an HDFS filesystem URI, whose host is the namenode’s
hostname or IP address, and port is the port that the namenode will listen on for RPCs.
If no port is specified, the default of 8020 is used.
MapReduce
To run MapReduce, you need to designate one machine as a jobtracker, which on small
clusters may be the same machine as the namenode. To do this, set the
mapred.job.tracker property to the hostname or IP address and port that the jobtracker
will listen on. Note that this property is not a URI, but a host-port pair, separated by
a colon. The port number 8021 is a common choice.
Hadoop Daemon Addresses and Ports
Hadoop daemons generally run both an RPC server for communication
between daemons and an HTTP server to provide web pages for human consumption
. Each server is configured by setting the network address and port number
to listen on. By specifying the network address as 0.0.0.0, Hadoop will bind to all
addresses on the machine. Alternatively, you can specify a single address to bind to. A
port number of 0 instructs the server to start on a free port: this is generally discouraged,
since it is incompatible with setting cluster-wide firewall policies.
Other Hadoop Properties
Some other properties that you might consider setting.
Cluster membership
To aid the addition and removal of nodes in the future, you can specify a file containing
a list of authorized machines that may join the cluster as datanodes or tasktrackers.
Buffer size
Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a conservative
setting, and with modern hardware and operating systems, you will likely see
performance benefits by increasing it; 128 KB (131,072 bytes) is a common choice. Set
this using the io.file.buffer.size property in core-site.xml.
HDFS block size
The HDFS block size is 64 MB by default, but many clusters use 128 MB (134,217,728
bytes) or even 256 MB (268,435,456 bytes) to ease memory pressure on the namenode
and to give mappers more data to work on. Set this using the dfs.block.size property
in hdfs-site.xml.
Reserved storage space
By default, datanodes will try to use all of the space available in their storage directories.
If you want to reserve some space on the storage volumes for non-HDFS use, then you
can set dfs.datanode.du.reserved to the amount, in bytes, of space to reserve.
Trash
Hadoop filesystems have a trash facility, in which deleted files are not actually deleted,
but rather are moved to a trash folder, where they remain for a minimum period before
being permanently deleted by the system. The minimum period in minutes that a file
will remain in the trash is set using the fs.trash.interval configuration property in
core-site.xml.
Job scheduler
Particularly in a multiuser MapReduce setting, consider changing the default FIFO job
scheduler to one of the more fully featured alternatives.
Reduce slow start
By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs this can cause problems with
cluster utilization, since they take up reduce slots while waiting for the map tasks to
complete. Setting mapred.reduce.slowstart.completed.maps to a higher value, such as
0.80 (80%), can help improve throughput.
Task memory limits
Hadoop provides two mechanisms for this. The simplest is via the Linux
ulimit command, which can be done at the operating system level (in the limits.conf
file, typically found in /etc/security), or by setting mapred.child.ulimit in the Hadoop
configuration. The value is specified in kilobytes, and should be comfortably larger
than the memory of the JVM set by mapred.child.java.opts; otherwise, the child JVM
might not start.
User Account Creation
Once you have a Hadoop cluster up and running, you need to give users access to it.
This involves creating a home directory for each user and setting ownership permissions
on it:
YARN Configuration
The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the
cluster. This script will start a resource manager (on the machine the script is run on),
and a node manager on each machine listed in the slaves file.
Security Enhancements
Security has been tightened throughout HDFS and MapReduce to protect against unauthorized
access to resources.
Benchmarking a Hadoop Cluster
Is the cluster set up correctly? The best way to answer this question is empirically: run
some jobs and confirm that you get the expected results. Benchmarks make good tests,
as you also get numbers that you can compare with other clusters as a sanity check on
whether your new cluster is performing roughly as expected.
Hadoop Benchmarks
Hadoop comes with several benchmarks that you can run very easily with minimal
setup cost. Benchmarks are packaged in the test JAR file, and you can get a list of them,
with descriptions, by invoking the JAR file with no arguments:
Hadoop in the Cloud
Cloudera offers tools for running Hadoop in a public or private cloud, and Amazon
has a Hadoop cloud service called Elastic MapReduce.
The Apache Whirr project (http://whirr.apache.org/) provides a Java API and a set of
scripts that make it easy to run Hadoop on EC2 and other cloud providers.The scripts
allow you to perform such operations as launching or terminating a cluster, or listing
the running instances in a cluster.
Excerpts from Hadoop: The Definitive Guide, Tom White, Pub by O'Reilly
More references
Design Considerations in Building a Hadoop Cluster
Jan 28, 2016
https://www.linkedin.com/pulse/design-considerations-building-hadoop-cluster-madhumoy-dube
Spinning a Free Hadoop Cluster on Amazon Cloud
http://insightdataengineering.com/blog/hadoopdevops/
Building and Administering Hadoop Clusters
Fac. . Jordan Boyd Graber
2011 presentation
http://www.umiacs.umd.edu/~jbg/teaching/INFM_718_2011/lecture_10.pdf
Hadoop Notes and Video Lectures
What is Hadoop? Text and Video Lectures
What is MapReduce? Text and Video Lectures
The Hadoop Distributed Filesystem (HDFS)
Hadoop Input - Output System
No comments:
Post a Comment