Use the following commmands to install Hadoop on all nodes:
for host in master slave1 slave2 slave3 slave4 slave5; do
echo 'Installing Hadoop on node: ' $host
sudo rpm -ivh ftp://hadoop.admin/repo/hadoop-1.1.2-1.x86_64.rpm
done
Run a sampele MapReduce job with the following command:
#Option 20 is the number of tasks to run and 100000 specifies the size of the sample for each task. hadoop jar $HADOOP_HOME/hadoop-examples*.jar pi 20 100000
Run adn example teragen job to generate 10 GB data on the HDFS with the following command:
#Option $((1024 * 1024 * 1024 * 10 / 100)) means how many lines of data will be generated with the total data size 10GB hadoop jar $HADOOP_HOME/hadoop-examples*.jar teragen $((1024 * 1024 * 1024 * 10/100)) teraout
Common Problems
Can't start HDFS daemons
The NameNode on the master node has not been formatted:hadoop namenode -formatCheck that HDFS has been properly configured and daemon are running. If the output of jps command does not contaion the NameNode and SecondaryNameNode daemons, we need to check the configuration of HDFS.
jpsOpen a new terminal and monitor the NameNode logfile on the master node. Alternatively, the
hadoop jobtracker command will give the same error.
tail -f $HADOOP_HOME/logs/hadoop.hduser-namenode-master.log
Cluster is missing slave nodes
Most likely, this problem is caused by hostname resolution. To confirm, check the content of the /etc/hosts file.
MapRedue daemons can't be started
The following two reasons can cause this problem:- The HDFS daemons are not running, which can cause the MapReduce daemons to ping the NameNode daemon at a regualar interval, which can be illustrated with the following log output:
13/02/16 11:32:19 INFO ipc.Client: Retrying connect to server: master/10.0.0.1:54310. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/02/16 11:32:20 INFO ipc.Client: Retrying connect to server: master/10.0.0.1:54310. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/02/16 11:32:21 INFO ipc.Client: Retrying connect to server: master/10.0.0.1:54310. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/02/16 11:32:22 INFO ipc.Client: Retrying connect to server: master/10.0.0.1:54310. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/02/16 11:32:23 INFO ipc.Client: Retrying connect to server: master/10.0.0.1:54310. Already tried 4 time(s); retry policy is RetryUpToMaximumCount
- Configuration problems of MapReduce can cause the MapReduce daemons can't be started problem. Before starting a cluster, we need to make sure that the total amount of configured memory should be smaller than the total amount of system memory.
For example, suppose a slave host has 4 GB of memory, and we have configured 6 map slots, and 6 reduce slots with 512 MB of memory for each slots. So, we can compute the total configured task memory with the following formula :
6 * 512 + 6 * 512 = 6GB
As 6GB is larger than the system memory of 4GB, the system will not start. To clear this problem, we can decrease the number of map slots and reduce slots from 6 to 3.
Configuring ZooKeeper
Log in to the the master Node as hduser:- sudo wget ftp://haoop.admin/repo/zookeeper-3.4.5.tar.gz -P /usr/local
- cd /usr/local
sudo tar xvf zookeeper-3.4.5.tar.gz - sudo ln -s /usr/local/zookeepr-3.4.5.tar.gz /usr/local/zookeeper
- Open ~/.bashrc file and add the following lines:
ZK_HOME=/usr/local/zookeeper export PATH=$ZK_HOME/bin:$PATH
- . ~/.bashrc
- sudo mkdir -pv /hadoop/zookeeper/{data,log}
- Create Java configuration file $ZK_HOME/conf/java.env:
JAVA_HOME=/usr/java/latest export PATH=$JAVA_HOME/bin:$PATH
- Create the $ZK_HOME/conf/zookeeper.cfg file:
tickTime=2000 clientPort=2181 initLimit=5 syncLimit=2 server.1=master:2888:3888 server.2=slave1:2888:3888 server.3=slave2:2888:3888 server.4=slave3:2888:3888 server.5=slave4:2888:3888 server.6=slave5:2888:3888 dataDir=/hadoop/zookeepr/data dataLogDir=/hadoop/zookeeper/log
- Configure ZooKeeper on all slave nodes with the following command:
for host in cat $HADOOP_HOME/conf/slaves; do echo 'Configuring ZooKeeper on ' $host scp ~/.bashrc hduser@$host:~/ sudo scp -R /usr/local/zookeeper-3.4.5 hduser@$host:/usr/local/ echo 'Making symbolic link for ZooKeeper home directory on ' $host sudo ssh hduser@$host -C "ln -s /usr/local/zookeeper-3.4.5 /usr/local/zookeeper" done - zkServer.sh start
- zkCli.sh -server master:2181
- zkServer.sh stop
Installing Mahout
Log in to the master node s hduser:- sudo wget ftp://hadoop.admin/repo/mahout-distribution-0.7.tar.gz / usr/local
- cd /usr/local
sudo tar xvf mahout-distribution-0.7.tar.gz - sudo ln -s /usr/local/mahout-distribution-0.7 /usr/local/mahout
- Open the ~/.bashrc file
export MAHOUT_HOME=/usr/local/mahout export PATH=$MAHOUT_HOME:$PATH
- sudo yum install maven
-
cd $MAHOUT_HOME
sudo mvn compile
sudo mvn install
The install command will run all the tests by default;sudo mvn -DskipTests installcommand will ignore the tests. - cd example
sudo mvn compile
- wget http://archive.ics.uci.edu/ml/databases/synthetic_control/ synthetic_control.data -P ~/
- start-dfs.sh
start-mapred.sh - hadoop fs -mkdir testdata
hadoop fs -put ~/synthetic_control.data testdata - mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
没有评论 :
发表评论