生活在别处: Chapter 3 Configuring a Hadoop Cluster

Use the following commmands to install Hadoop on all nodes:

for host in master slave1 slave2 slave3 slave4 slave5; do
    echo 'Installing Hadoop on node: ' $host
    sudo rpm -ivh ftp://hadoop.admin/repo/hadoop-1.1.2-1.x86_64.rpm
done

Run a sampele MapReduce job with the following command:

#Option 20 is the number of tasks to run and 100000 specifies the size of the sample for each task.
hadoop jar $HADOOP_HOME/hadoop-examples*.jar pi 20 100000

Run adn example teragen job to generate 10 GB data on the HDFS with the following command:

#Option $((1024 * 1024 * 1024 * 10 / 100)) means how many lines of data will be generated with the total data size 10GB
hadoop jar $HADOOP_HOME/hadoop-examples*.jar teragen $((1024 * 1024 * 1024 * 10/100)) teraout

Common Problems

Can't start HDFS daemons

The NameNode on the master node has not been formatted:

hadoop namenode -format

Check that HDFS has been properly configured and daemon are running. If the output of jps command does not contaion the NameNode and SecondaryNameNode daemons, we need to check the configuration of HDFS.

jps

Open a new terminal and monitor the NameNode logfile on the master node. Alternatively, the hadoop jobtracker command will give the same error.

tail -f $HADOOP_HOME/logs/hadoop.hduser-namenode-master.log

Cluster is missing slave nodes

Most likely, this problem is caused by hostname resolution. To confirm, check the content of the /etc/hosts file.

MapRedue daemons can't be started

The following two reasons can cause this problem:

The HDFS daemons are not running, which can cause the MapReduce daemons to ping the NameNode daemon at a regualar interval, which can be illustrated with the following log output:

13/02/16 11:32:19 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 0
time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
13/02/16 11:32:20 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
13/02/16 11:32:21 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
13/02/16 11:32:22 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
13/02/16 11:32:23 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 4 time(s); retry policy is
RetryUpToMaximumCount

Configuration problems of MapReduce can cause the MapReduce daemons can't be started problem. Before starting a cluster, we need to make sure that the total amount of configured memory should be smaller than the total amount of system memory.
For example, suppose a slave host has 4 GB of memory, and we have configured 6 map slots, and 6 reduce slots with 512 MB of memory for each slots. So, we can compute the total configured task memory with the following formula :
6 * 512 + 6 * 512 = 6GB
As 6GB is larger than the system memory of 4GB, the system will not start. To clear this problem, we can decrease the number of map slots and reduce slots from 6 to 3.

Configuring ZooKeeper

sudo wget ftp://haoop.admin/repo/zookeeper-3.4.5.tar.gz -P /usr/local
cd /usr/local
sudo tar xvf zookeeper-3.4.5.tar.gz
sudo ln -s /usr/local/zookeepr-3.4.5.tar.gz /usr/local/zookeeper

Open ~/.bashrc file and add the following lines:

ZK_HOME=/usr/local/zookeeper
export PATH=$ZK_HOME/bin:$PATH

. ~/.bashrc
sudo mkdir -pv /hadoop/zookeeper/{data,log}

Create Java configuration file $ZK_HOME/conf/java.env:

JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH

Create the $ZK_HOME/conf/zookeeper.cfg file:

tickTime=2000
clientPort=2181
initLimit=5
syncLimit=2
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888
server.5=slave4:2888:3888
server.6=slave5:2888:3888
dataDir=/hadoop/zookeepr/data
dataLogDir=/hadoop/zookeeper/log

Configure ZooKeeper on all slave nodes with the following command:

for host in cat $HADOOP_HOME/conf/slaves; do
    echo 'Configuring ZooKeeper on ' $host
    scp ~/.bashrc hduser@$host:~/
    sudo scp -R /usr/local/zookeeper-3.4.5 hduser@$host:/usr/local/
    echo 'Making symbolic link for ZooKeeper home directory on ' $host
    sudo ssh hduser@$host -C "ln -s /usr/local/zookeeper-3.4.5 /usr/local/zookeeper"
done

zkServer.sh start
zkCli.sh -server master:2181
zkServer.sh stop

Installing Mahout

sudo wget ftp://hadoop.admin/repo/mahout-distribution-0.7.tar.gz / usr/local
cd /usr/local
sudo tar xvf mahout-distribution-0.7.tar.gz
sudo ln -s /usr/local/mahout-distribution-0.7 /usr/local/mahout

Open the ~/.bashrc file

export MAHOUT_HOME=/usr/local/mahout
export PATH=$MAHOUT_HOME:$PATH

sudo yum install maven
cd $MAHOUT_HOME
sudo mvn compile
sudo mvn install
The install command will run all the tests by default; sudo mvn -DskipTests install command will ignore the tests.
cd example
sudo mvn compile

Verify Mahout configuration:

wget http://archive.ics.uci.edu/ml/databases/synthetic_control/ synthetic_control.data -P ~/
start-dfs.sh
start-mapred.sh
hadoop fs -mkdir testdata
hadoop fs -put ~/synthetic_control.data testdata
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

生活在别处

2014年4月25日星期五

Chapter 3 Configuring a Hadoop Cluster