2014年4月25日星期五

Chapter 2 Preparing for Hadoop Installation

Generally, for a small to medium-sized cluster with up to a hundred slave nodes, the NameNode, JobTracker, and SecondaryNameNode daemons can be put on the same master machine. When the cluter grows up to hundreds or even thousands of slave nodes, it becomes advisable to put these daemons on different machines.

从安全方面来看,NameNode 和 SecondaryNameNode 应该不放在同一主机上。

Empirically, recommend the following configurations for a small to medium-sized Hadoop cluster.

Node type Node components Recommended specification
Master Node CPU 2 Quad Core, 2.0 GHz
RAM 16 GB
Hard drive 2 x 1TB SATA II 7200 RPM HDD or SSD
Network card 1GBps Ethernet
Slave Node CPU 2 Quad Core
RAM 16 GB
Hard drive 4 x 1TB HDD
Network card 1GBps Ethernet

在此基础上,再根据主要业务是计算还是存储做对应的调整。其实,Master CPU 主频才 2.0, 这列出的标准是大大低于预期。


In default configuration, the master node is a single failure point. High-end computing hardware and secondary power supplies are suggest.


In Hadoop, each slave nodes simultaneously executes a number of map or reduce tasks. The maximum number of parallel map/reduce tasks are known as a map/reduce slots, which are configurable by a Hadoop administrator. Each slot is a computing unit consisting of CPU, memory disk I/O resources. When a slave node was assigned a task by the JobTracker, its TaskTracker will fork a JVM for that task, allocating a preconfigured amount of computing resources. In addition, each forked JVM also will incur a certain amount of memory requirements. Empiricaly, a Hadoop job can consume 1 GB to 2GB of memory for each CPU core. Higher data throughput requirement can incure high I/O operations for the majority of Hadoop jobs. That's why higher end and parallel hard drivers can help boost the cluster performance. To maximum parallelism, it is advisable to assign two slots for each CPU core. For example, if our slave node has two quad-core CPUs, we ca assign 2 x 4 x 2 = 16 (map only, reduece only, or both) slots in total this node.(The first 2 stands for the number of CPUs of the slave node, the number 4 represents the number of cores per CPU, and the other 2 means the number of slots per CPU core.)

注:

  1. In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had. These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces. Essentially, YARN has no TaskTrackers, but just generic NodeManagers. Hence, there's no more Map slots and Reduce slots spearation. Everything depneds on the amount of memory in use/demanded.
  2. YARN configuration reference , http://stackoverflow.com/questions/22069904/controling-and-monitorying-number-of-simultaneous-map-reduce-tasks-in-yarn

Designing the cluster network

集群节点应该位于同一个网段,不推荐通过 VPN 或 路由来实现这个。

Nodes on the same rack(机架) can be interconnected with a 1 GBps Ethernete switch. Cluster level switches then connect the rack switches with faster links, such as 10 GBps optical fiber links, and other networks such as InfiniBand. The cluster-level switches may also interconnect with other cluster-level switches or even uplink to another higher level of switching infrastructure. With the increasing size of cluster, the network, at the same time, will become larger and more complex. Connection redundancies for network high availability can also increase its complexity.

Configuring the cluster administrator machine,below commands are executed on CentOS 6.3

  1. Log in as hddmin and change hostname
    sudo sed -i 's/^HOSTNAME.*$HOSTNAME=hadoop.admin/' /etc/sysconfig/network
  2. mkdir -v ~/mnt ~/isoimages ~/repo (~/mnt as the mount point for ISO images. ~/isoimages will be used to contain the original image files. ~/repo will be used as the repository folder for network installation)
  3. sudo yum -y install dhcp
    sudo yum -y install vsftpd (install the ftp servers)
  4. rsync rsync://mirror.its.dal.ca/centos/6.3/isos/x86_64/CentOS-6.3-x86_64-netinstall.iso ~/isoimages
    or wget http://mirror.its.dal.ca/centos/6.3/isos/x86_64/CentOS-6.3-x86_64-netinstall.iso -P ~/isoimages
  5. sudo mount ~/isoimages/ /CentOS-6.3-x86_64-netinstall.iso
  6. cp -r ~/mnt/* ~/repo
  7. sudo umount ~/mnt
Configure the DHCP Server
  1. /etc/dhcp/dhcpd.conf
    # Domain name
    option domain-name  "hadoop.cluster";
    
    # DNS hostname  or IP address
    option domain-name-servers dlp.server.world;
    
    # Default lease time
    default-lease-time 600;
    
    # Maximum lease time
    max-lease-time  7200;
    
    # Declare the DHCP server to be valid.
    authoritative;
    
    # Network address and subnet mask
    subnet 10.0.0.0 netmask 255.255.255.0 {
    
    # Range of lease IP address, should be based 
        # on the size of the network
        range dynamic-bootp 10.0.0.200 10.0.0.254;
    
        # Broadcast address
        option broadcast-address 10.0.0.255;
    
        # Default gateway
        option routers 10.0.0.1;
    }
        
  2. sudo service dhcpd start
  3. sudo chkconfig dhcpd --level 3 on (Make the DHCP Server to survive a system reboot)
Configure the FTP server
  1. /etc/vsftpd/vsftpd.conf
    # The FTP server will run in standalone mode.
    listen=YES
    
    # Use Anonymous user.
    anonymous_enable=YES
    
    # Disable change root for local users.
    chroot_local_user=NO
    
    # Disable uploading and changing files.
    write_enable=NO
    
    # Enable logging of uploads and downloads.
    xferlog_enable=YES
    
    # Enable port 20 data transfer
    connect_from_port_20=YES
    
    # Specify directory for hosting the Lnux installation packages.
    anon_ropot=~/repo
        
  2. sudo service vsftpd start
  3. ftp hadoop.admin

Creating the kickstart file and boot media

  1. check the filesystem type of USB flash drive
    blkid
  2. If the TYPE attribute is other than vfat, use the following command to clear the first few blocks of the drive:
    dd if=/dev/zero of=/dev/sdb1 bs=1M count=100
Create a kickstart file
  1. sudo yum install system-config-kickstart
  2. ks.cfg
    #! /bin/bash
    # Kickstart for CentOS 6.3 for Hadoop cluster.
    
    #Install system on the machine
    install
    
    # Use ftp as the package repository
    url --url ftp://hadoop.admin/repo
    
    # Use the text installation interface
    text
    
    # Use UTF-8 encoded USA English as the language.
    lang en_US.UTF_8
    
    # Configure time zone.
    timezone America/New_York
    
    # Use USA keyboard
    keyboard us
    
    # Set bootloader location
    bootloader --location=mbr --driveorder=sda rhgb quiet
    
    # Set root password
    rootpw --password=hadoop
    
    #############################
    # Partion the hard disk
    #############################
    # Clear the master boot record on the hard drive.
    zerombr yes
    
    # Clear existing partitions
    clearpart --all --initlabel
    
    # Clear /boot partition, size is in MB.
    part /boot --fstype ext3 --size 128
    
    # Create / (root) partiton.
    part / --fstype ext3 --size 4096 --grow --maxsize 8192
    
    # Create /var partition.
    part /var --fstype ext3 --size 4096 --grow --maxsize 8192
    
    # Create Hadoop data storage directory
    part /hadoop --fstype ext3 --grow
    
    # Create swap partition, 16GB, double size of the main memory.
    # Change size according to your hardware memory configuration.
    part swap --size 16384
    
    
    #############################
    # Configure Network device
    #############################
    
    # Use DHCP and disable IPv6
    network --onboot yes --device eth0 --bootproto dhcp --noipv6
    
    # Disable firewall.
    firewall --disabled
    
    # Put Selinux in permissive mode.
    selinux --permissive
    
    
    #############################
    # Specify packages to install
    #############################
    
    # Automatically resolv package dependencies, 
    # exclude installation of documents and ignore missing packages.
    %packages --resolvedeps  --excludeddocs --ignoremissing
    
    # Install core packages.
    @Base
    
    # Don't install OpenJDK.
    -java
    
    # Install wget
    wget
    
    # Install the vim text editor
    vim
    
    # Install the Emacs text editor
    emacs
    
    # Install rsync
    rsync
    
    # Install nmap network mapper.
    nmap
    
    %end
    
    
    ###################################
    # Post installation configuration.
    ###################################
    
    # Enable post process logging
    %post --log=~/install-post.log
    
    # Create Hadoop user hduser with password hduser.
    useradd -m -p hduser hduser
    
    # Create group Hadoop.
    groupadd hadoop
    
    # Change user hduser's current group to hadoop
    usermod -g hadoop hduser
    
    # Tell the nodes hostname and ip address of the admin machine.
    echo "10.0.0.1 hadoop.admin" >> /etc/hosts
    
    # Configure administrative privilege to hadoop group.
    
    # Configure the kernel settings.
    ulimit -u
    
    
    #############################
    # Startup services.
    #############################
    
    service sshd start
    chkconfig sshd on
    
    %end
    
    # Reboot after installation.
    reboot
    
    # Disable first boot configuration.
    firstboot --disable
        
  3. Put the kickstart file into the root directory of the FTP server with the command:
    cp ks.cfg ~/repo

Create a USB boot media

  1. ~/isolinux/grub.conf, add the following content:
    default=0
    splashimage=@SPLASHPATH@
    timeout 0
    hiddenmenu
    title @PRODUCT@ @VERSION@
        kernel @KERNELPATH@ ks=ftp://hadoop.admin/ks.cfg
        initrd @INITRDPATH@
        
  2. Make an ISO fiel from the isolinux directory using the following commands:
    mkisofs -o CentOS6.3-x86_64--boot.iso \
    -b ~/repo/isolinux/isolinux.bin \
    -c ~/repo/isolinux/boot.cat \
    -no-emul-boot \
    -boot-load-size 4
        
  3. Write the bootable ISO image into the USB flash drive:
    dd if=~/CentOS6.3-x86-64-boot.iso of=/dev/sdb

没有评论:

发表评论