生活在别处: Chapter 2 Preparing for Hadoop Installation

Generally, for a small to medium-sized cluster with up to a hundred slave nodes, the NameNode, JobTracker, and SecondaryNameNode daemons can be put on the same master machine. When the cluter grows up to hundreds or even thousands of slave nodes, it becomes advisable to put these daemons on different machines.

从安全方面来看，NameNode 和 SecondaryNameNode 应该不放在同一主机上。

Empirically, recommend the following configurations for a small to medium-sized Hadoop cluster.

Node type	Node components	Recommended specification
Master Node	CPU	2 Quad Core, 2.0 GHz
	RAM	16 GB
	Hard drive	2 x 1TB SATA II 7200 RPM HDD or SSD
	Network card	1GBps Ethernet
Slave Node	CPU	2 Quad Core
	RAM	16 GB
	Hard drive	4 x 1TB HDD
	Network card	1GBps Ethernet

在此基础上，再根据主要业务是计算还是存储做对应的调整。其实，Master CPU 主频才 2.0, 这列出的标准是大大低于预期。

In default configuration, the master node is a single failure point. High-end computing hardware and secondary power supplies are suggest.

In Hadoop, each slave nodes simultaneously executes a number of map or reduce tasks. The maximum number of parallel map/reduce tasks are known as a map/reduce slots, which are configurable by a Hadoop administrator. Each slot is a computing unit consisting of CPU, memory disk I/O resources. When a slave node was assigned a task by the JobTracker, its TaskTracker will fork a JVM for that task, allocating a preconfigured amount of computing resources. In addition, each forked JVM also will incur a certain amount of memory requirements. Empiricaly, a Hadoop job can consume 1 GB to 2GB of memory for each CPU core. Higher data throughput requirement can incure high I/O operations for the majority of Hadoop jobs. That's why higher end and parallel hard drivers can help boost the cluster performance. To maximum parallelism, it is advisable to assign two slots for each CPU core. For example, if our slave node has two quad-core CPUs, we ca assign 2 x 4 x 2 = 16 (map only, reduece only, or both) slots in total this node.(The first 2 stands for the number of CPUs of the slave node, the number 4 represents the number of cores per CPU, and the other 2 means the number of slots per CPU core.)

注：

In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had. These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces. Essentially, YARN has no TaskTrackers, but just generic NodeManagers. Hence, there's no more Map slots and Reduce slots spearation. Everything depneds on the amount of memory in use/demanded.
YARN configuration reference , http://stackoverflow.com/questions/22069904/controling-and-monitorying-number-of-simultaneous-map-reduce-tasks-in-yarn

Designing the cluster network

集群节点应该位于同一个网段，不推荐通过 VPN 或路由来实现这个。

Nodes on the same rack(机架) can be interconnected with a 1 GBps Ethernete switch. Cluster level switches then connect the rack switches with faster links, such as 10 GBps optical fiber links, and other networks such as InfiniBand. The cluster-level switches may also interconnect with other cluster-level switches or even uplink to another higher level of switching infrastructure. With the increasing size of cluster, the network, at the same time, will become larger and more complex. Connection redundancies for network high availability can also increase its complexity.

Configuring the cluster administrator machine,below commands are executed on CentOS 6.3

Log in as hddmin and change hostname
sudo sed -i 's/^HOSTNAME.*$HOSTNAME=hadoop.admin/' /etc/sysconfig/network
mkdir -v ~/mnt ~/isoimages ~/repo (~/mnt as the mount point for ISO images. ~/isoimages will be used to contain the original image files. ~/repo will be used as the repository folder for network installation)
sudo yum -y install dhcp
sudo yum -y install vsftpd (install the ftp servers)
rsync rsync://mirror.its.dal.ca/centos/6.3/isos/x86_64/CentOS-6.3-x86_64-netinstall.iso ~/isoimages
or wget http://mirror.its.dal.ca/centos/6.3/isos/x86_64/CentOS-6.3-x86_64-netinstall.iso -P ~/isoimages
sudo mount ~/isoimages/ /CentOS-6.3-x86_64-netinstall.iso
cp -r ~/mnt/* ~/repo
sudo umount ~/mnt

Configure the DHCP Server

/etc/dhcp/dhcpd.conf

# Domain name
option domain-name  "hadoop.cluster";

# DNS hostname  or IP address
option domain-name-servers dlp.server.world;

# Default lease time
default-lease-time 600;

# Maximum lease time
max-lease-time  7200;

# Declare the DHCP server to be valid.
authoritative;

# Network address and subnet mask
subnet 10.0.0.0 netmask 255.255.255.0 {

# Range of lease IP address, should be based 
    # on the size of the network
    range dynamic-bootp 10.0.0.200 10.0.0.254;

    # Broadcast address
    option broadcast-address 10.0.0.255;

    # Default gateway
    option routers 10.0.0.1;
}

sudo service dhcpd start
sudo chkconfig dhcpd --level 3 on (Make the DHCP Server to survive a system reboot)

Configure the FTP server

/etc/vsftpd/vsftpd.conf

# The FTP server will run in standalone mode.
listen=YES

# Use Anonymous user.
anonymous_enable=YES

# Disable change root for local users.
chroot_local_user=NO

# Disable uploading and changing files.
write_enable=NO

# Enable logging of uploads and downloads.
xferlog_enable=YES

# Enable port 20 data transfer
connect_from_port_20=YES

# Specify directory for hosting the Lnux installation packages.
anon_ropot=~/repo

sudo service vsftpd start
ftp hadoop.admin

Creating the kickstart file and boot media

check the filesystem type of USB flash drive
blkid
If the TYPE attribute is other than vfat, use the following command to clear the first few blocks of the drive:
dd if=/dev/zero of=/dev/sdb1 bs=1M count=100

Create a kickstart file

sudo yum install system-config-kickstart

ks.cfg

#! /bin/bash
# Kickstart for CentOS 6.3 for Hadoop cluster.

#Install system on the machine
install

# Use ftp as the package repository
url --url ftp://hadoop.admin/repo

# Use the text installation interface
text

# Use UTF-8 encoded USA English as the language.
lang en_US.UTF_8

# Configure time zone.
timezone America/New_York

# Use USA keyboard
keyboard us

# Set bootloader location
bootloader --location=mbr --driveorder=sda rhgb quiet

# Set root password
rootpw --password=hadoop

#############################
# Partion the hard disk
#############################
# Clear the master boot record on the hard drive.
zerombr yes

# Clear existing partitions
clearpart --all --initlabel

# Clear /boot partition, size is in MB.
part /boot --fstype ext3 --size 128

# Create / (root) partiton.
part / --fstype ext3 --size 4096 --grow --maxsize 8192

# Create /var partition.
part /var --fstype ext3 --size 4096 --grow --maxsize 8192

# Create Hadoop data storage directory
part /hadoop --fstype ext3 --grow

# Create swap partition, 16GB, double size of the main memory.
# Change size according to your hardware memory configuration.
part swap --size 16384


#############################
# Configure Network device
#############################

# Use DHCP and disable IPv6
network --onboot yes --device eth0 --bootproto dhcp --noipv6

# Disable firewall.
firewall --disabled

# Put Selinux in permissive mode.
selinux --permissive


#############################
# Specify packages to install
#############################

# Automatically resolv package dependencies, 
# exclude installation of documents and ignore missing packages.
%packages --resolvedeps  --excludeddocs --ignoremissing

# Install core packages.
@Base

# Don't install OpenJDK.
-java

# Install wget
wget

# Install the vim text editor
vim

# Install the Emacs text editor
emacs

# Install rsync
rsync

# Install nmap network mapper.
nmap

%end


###################################
# Post installation configuration.
###################################

# Enable post process logging
%post --log=~/install-post.log

# Create Hadoop user hduser with password hduser.
useradd -m -p hduser hduser

# Create group Hadoop.
groupadd hadoop

# Change user hduser's current group to hadoop
usermod -g hadoop hduser

# Tell the nodes hostname and ip address of the admin machine.
echo "10.0.0.1 hadoop.admin" >> /etc/hosts

# Configure administrative privilege to hadoop group.

# Configure the kernel settings.
ulimit -u


#############################
# Startup services.
#############################

service sshd start
chkconfig sshd on

%end

# Reboot after installation.
reboot

# Disable first boot configuration.
firstboot --disable

Put the kickstart file into the root directory of the FTP server with the command:
cp ks.cfg ~/repo

Create a USB boot media

~/isolinux/grub.conf, add the following content:

default=0
splashimage=@SPLASHPATH@
timeout 0
hiddenmenu
title @PRODUCT@ @VERSION@
    kernel @KERNELPATH@ ks=ftp://hadoop.admin/ks.cfg
    initrd @INITRDPATH@

Make an ISO fiel from the isolinux directory using the following commands:

mkisofs -o CentOS6.3-x86_64--boot.iso \
-b ~/repo/isolinux/isolinux.bin \
-c ~/repo/isolinux/boot.cat \
-no-emul-boot \
-boot-load-size 4

Write the bootable ISO image into the USB flash drive:
dd if=~/CentOS6.3-x86-64-boot.iso of=/dev/sdb

生活在别处

2014年4月25日星期五

Chapter 2 Preparing for Hadoop Installation