生活在别处: 四月 2014

2014年4月25日星期五

Chapter 4 Managing a Hadoop Cluster

Managing the HDFS Cluster

Use the following steps to check the status of an HDFS cluster with hadoop fsck:

Check the status of the root filesystem with the following command:
hadoop fsck /
Check the status of all the files on HDFS with the following command:
hadoop fsck / -files
Check the locations of file blocks with the following command:
hadoop fsck / -files -locations
Check the locations of file blocks containing rack information with the following command:
hadoop fsck / -files -blocks -racks
Delete corrupted files with the following command:
hadoop fsck -delete
Move corrupted files to /lost+found with the following command:
hadoop fsck -move

Use the following steps to check the status of and HDFS cluster with hadop dfsadmin:

Report the status of each slvae node with the following command:
hadoop dfsadmin -report
Refresh all the DataNodes using the following command:
hadoop dfsadmin -refreshNodes
Check the status of the safenode using the following command:
hadoop dfsadmin -safemode get
Manually put the NameNode into safemode using the following command:
hadoop dfsadmin -safemode enter
Make the NameNode to leave safemode using the following command:
hadoop dfsadmin -safemode leave
Wait until NameNode leaves safemode using the following command:
hadoop dfsadmin -safemode wait
This command is useful when we want to wait until HDFS finishes data block replication or wait until a newly commissioned DataNode to be ready for servcie.
Save the metadata of HDFS filesystem with the following command:
hadoop dfsadmin -metadata meta.log
The meta.log file will be created under the directory $HADOOP_HOME/logs .

The HDFS filesystem will be write proteced when NameNode enters safe mode. When an HDFS cluster is started, it will enter safe mode first. The NameNode will check the replication factor for each data block. If the replica count of a data block is smaller than the configured value, which is 3 by default, the data block will be marked as under-replicated. Finally, an under-replication factor, which is the percentage of under-replicated data blocks, will be calculated. If the percentage number is larger than the threshold value, the NameNode will stay in safe mode until enough new replicas are created for the under-replicated data blocks so as to make the under-relication factor lower than the threshold.

Configuring SecondaryNameNode

Hadoop NameNode is a simple point of failure. By configuring SeconaryNameNode, the filesystem image and edit log can be backed up periodically. And in case of NameNoade failure, the backup files can be used to recover the NameNode.
Configure SecondaryNameNode:

stop-all.sh
Add or change the following into the fiel $HADOOP_HOEM/conf/hdfs-site.xml
```
fs.checkpoint.dir
/hadoop/dfs/namesecondary
```
If this property is not set explicitly, the default checkpoint directory will be ${hadoop.tmp.dir}/dfs/namesecondary
start-all.sh

To increase redundancy, we can configure NameNode to write filesystem metatdata on multiple locations. For example, we can add an NFS shared directory for backup by changing the following property in the file $HADOOP_HOME/conf/hdfs-site.xml:


    dfs.name.dir
    /hadoop/dfs/name,/nfs/name

Managing the MapReduce cluster

List all the active TaskTrackers:
hadoop -job -list-active-trackers
check the status of the JobTracker safe mode:
hadoop mradmin -safemode get
The output tells us that the JobTracker is not in safe mode. We can submit jobs to the cluster.
Manually let the JobTracker enter safe mode:
hadoop mradmin -safemode enter
Let the JobTracker leave safe mode:
hadoop mradmin -safemode leave
Wait for safe mode to exit:
hadoop mradmin -safemode wait
Reload the MapReduce queue configuration:
hadoop mradmin -refreshQueues
Reload active TaskTrackers:
hadoop mradmin -refreshNodes

The meaning of the hadoop mradmin options is listed in the following table:

Option	Description
-refreshServiceAcl	Force JobTracker to reload ACL.
-refreshQueues	Force JobTracker to reload queue configurations.
-refreshUserToGroupsMappings	Force JobTracker to reload user group mappings.
-refreshSuperUserGroupsConfiguration	Force JobTracker to reload super user group mappings
-refreshNodes	Force JobTracker to refresh the TaskTracker hosts.

Managing TaskTracker

Hadoop maintains three lists for TaskTrackers: black list, gray list and excluded list.

TaskTracker black listing is a function that can blacklist a TaskTracker if it is in an unstable state or its performance has been downgraded. For example, when the ratio of failed tasks for a specific job has reached a certain threashold, the TaskTracker will be blacklisted for this job. Similarly, Hadoop maintaions a gray list of nodes by identifying potential problems nodes.

Sometimes, excluding certain TaskTackers from the cluster is desirable. For example, when we debug ofr upgrade a slave node, we want to separate this node from the cluster in case it affects the cluster. Hadoop supports the live decommission of a TaskTracker from a running cluster.

List the active trackers with the following command on the master node:

hadoop job -list-active-trackers

Configure the heartbeat interval

stop-mapred.sh

$HADOOP_HOME/conf/mapred-site.xml


    mapred.tasktracker.expiry.interval
    
    600000

Copy the configuration into the slave nodes:

for host in 'cat $HDOOP_HOME/conf/slaves'; do
    echo 'Copying mapred-site.xml to slave node ' $host
    sudo scp $HADOOP_HOME/conf/mapred-site.xml hduser@$host:$HADOOP_HOME/conf

start-mapred.sh

Configur TaskTracker blacklisting

stop-mapred.sh
Set the number of task failure for a job to blacklist a TaskTracker by adding or chaning the following property in the fiel $HADOOP_HOME/conf/mapred-site.xml
```
    mapred.max.tracker.failures
    10
```
Set the maximum number of successful jobs that can blacklist a TaskTracker:
```
    mapred.max.tracker.blacklists
    5
```

for host in 'cat $HADOOP_HOME/conf/slaves'; do
    echo 'Copying mapred-site.xml to slave node ' $host
    sudo scp $HADOOP_HOME/conf/mapred-site.xml hduser@$host:$HADOOP_HOME/conf
done

start-mapred.sh
List blacklisted TaskTrackers:
hadoop job -list-blacklisted-trackers

Decommission TaskTrackers

Set the TaskTracker exclude file in $HADOOP_HOEM/conf/mapred-site.xml :


    mapred.hosts.exclude
    $HDOOP_HOME/conf/mapred-exclude

Force the JobTracker to reload the TaskTracker list:
hadoop mradmin -refreshNodes
List all the active TaskTrackers again:
hadoop job -list-active-trackers

TaskTrackers on slave nodes contact the JobTracker on the master node periodically. The interval between two consecutive contact communications is called a heartbeat. More frequent heartbeat configurations can incur higher loads to the cluster. The value of the heartbeat propert should be based on the size of the cluster.

The JobTracker uses TaskTacker blacklisting to remove those unstable TaskTrackers. If a TaskTracker is blacklsited, all the tasks currently running on the TaskTracker can still finish and the TaskTracker will continue the connection with JobTracker through the heartbeat mechanism. But the TaskTracker will not be scheduled for running future tasks. If a blacklisted TaskTracker is restarted, it will be removed from the blacklist.(The total number of TaskTrackers should not exceed 50 percent of the number of TaskTrackers.)

Decommissioning DataNode

Create the file $HADOOP_HOME/conf/dfs-exclude:
The dfs-exclude file contains the DataNode hostnames, one per line.

$HDOOP_HOME/conf/hdfs-site.xml:


    dfs.hosts.exclude
    $HADOOOP_HOME/conf/dfs-exclude

hadoop dfsadmin -refreshNodes
hadoop dfsadmin -report

Replacing a Salve Node

Decommission the TaskTacker on the slave node.
Decommission the DataNode on the slave node.
Power off the slae node and replace it with the new hardware.
Install and configure the Linux Operation system on the new node. Installing Java and other tools, and Configuring SSH, Preparing for Hadoop Installation.
Install Hadoop on the new node.
Log in to new node and start the DataNode and TaskTracker(assuming slave2):
ssh hduser@slave2 -C "hadoop DataNode &"
ssh hduser@slave3 -C "hadoop TaskTracker &"
Refresh the DataNodes:
hadoop dfsadmin -refreshNodes
Refresh the TaskTrackers:
hadoop mradmin -refreshNodes
hadoop dfsadmin -report
hadoop job -list-active-trackers

Managing MapReduce jobs

Check the status of Hadoop jobs

List all the running jobs:
hadoop job -list
List all the submitted jobs since the start of the cluster:
hadoop job -list all
Check the status of the default queue:
hadoop queue -list
Hadoop manages jobs using queue. By default, there is only one default queue. The output of the command shows that the cluster has only one default queue, which is the running state with no scheduling information.
Check the status of a queue ACL:
hadoop queue -showacls
Show all the jobs in the default queue:
hadoop queue -info default -showJobs
Check the status of job:
hadoop job -status JOB-ID

Chaning the status of job

Set the job JOB-ID be on high priority:
hadoop job -set-priority JOB-ID HIGH
Available priorities, in descending order, include : VERY_HIGH, HIGH, NORMAL, LOW AND VERY_LOW.
Kill the job:
hadoop job -kill JOB-ID

Submit a MapReduce job

Create the job configuration file, job.xml:


    
        mapred.input.dir
        randtext
    

    
        mapred.output.dir
        output
    

    
        mapred.job.name
        wordcount
    

    
        mapred.mapper.class
        org.apache.hadoop.mapred.WordCount$Map
    

    
        mapred.combiner.class
        org.apache.hadoop.mapred.WordCount$Reduce
    

    
        mapred.reducer.class
        org.apache.hadoop.mapred.WordCount$Reduce
    

    
        mapred.input.format.class
        org.apache.hadoop.mapred.TextInputFormat
    

    
        mapred.output.format.class
        org.apache.hadoop.mapred.TextOutputFormat
    

Make sure $HADOOP_HOME/hadoop-examples*.jar is available in CLASSPATH.

Submit the job:
hadoop job -submit job.xml

The queue command is a wrapper command for the JobQueueClient class

, and the job command is wrapper command for the JobClient class. page 107

More job managment commands

Get the value of a counter:
hadoop job -counter eg: get the counter HDFS_BYTES_WRITTEN of the counter group FileSystemCounter for the job job_201302281451_0002:
hadoop job -counter job_201302281451_0002 FileSystemCounter HDFS_BYTES_WRITTEN
Query events of a MapReduce job:
hadoop job -events <#-of-events> eg: query the first 10 events of the job job_201302281451_0002:
hadoop job -events job_201302281451_0002 0 10
Get the job history include job details, failed and killed jobs:
hadoop job -history

Managing tasks

Kill a task:
hadoop job -kill-task
After the task is killed, the JobTracker will restart the task on a different node.
Hadoop JobTracker can automatically kill tasks in the following situations:
- A task node not report progress after timeout.
- Speculative execution can run one task on multiple nodes; if one of these tasks has successed, other attempts of the same task will be killed because the attempt results for those attempts will be useless.
- Job/Task schedulers, such as fair scheduler and capacity scheduler, need empty slots for other pools or queues.
Set a task to fail:
hadoop job -fail-task
List task attempts:
hadoop job -list-attempt-ids
Available task types are map, reduce, setup, and clean; avaiable task state are running and completed

Checking Job hitory from web UI

The web UI can be automatically updated every five seconds; this interval can be modified by changing the mapreduce.client.completion.pollinterval property in the $HADOOP_HOME/conf/mapred-site.xml file.


    mapreduce.client.completion.pollinterval
    5000

Importing Data to HDFS

Use the distributed copy the large data to HDFS:
hadoop distcp file://data/datafile /user/hduser/data
This command will initate a MapReduce job with a number of mappers to run the copy task in parallel.

Manipulating file on HDFS

Check the content of file:
hadoop fs -cat file1
This command is handy to check the contents of small files. But when the file is large, it is not recommended. Instead, we can use the command hadoop fs -tail file1 to check the content of last few lines.
Alternatively, hadoop fs text file1 command shows the content of file1 in text format.
Test file1 exists:
hadoop fs -test -e fil1
hadoop fs -test -z file1
hadoop fs -test -d file1
Check the status of file1:
hadoop fs -stat file1

Perform the following steps to minipulate files and directories on HDFS:

Empty the trash:
hadoop fs -expunge
Merge files in a directory dir and download it as one big fiel:
hadoop fs -getmerge dir file1
This command is similar to the cat command in Linux. It is very useful when we want to get the MapReduce ouput as one file rather than serveral small partitioned files.
Delete file1 unser the current directory:
hadoop fs -rm file1
Download file1 from HDFS:
hadoop fs -get file1
Change the group memebership of a regular file:
hadoop fs -chgrp hadoop file1
Change the ownership of a regular file:
hadoop fs -chown hduser file1
Change the mode of a file
hadoop fs -chmod 600 file1
Set the replication factor of file1 to be 3:
hadoop fs -setrep -w 3 file1
Create an empty file:
hadoop fs -touchz 0file

Configuring the HDFS quota

Manage an HDFS quota:

Set the name quota on the home directory:
hadoop dfsadmin -setQuota 20 /user/hduser
This command will set name quota on the home directory to 20, which means at most 20 files inclusing directories can be created under the home directory.If we reach the quota, an error message will be given.
Set the space quota of the current user's home directory to be 100000000:
hadoop dfsadmin -setSpaceQuota 100000000 /home/hduser
Check the quota status:
hadoop fs -count -q /user/hduser
The meaning of output columns is DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME or QUOTA, REMAINING_QUOTA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE,FILE_NAME.
Clear the name quota:
hadoop dfsadmin -clrQuota /user/hduser
Clear the space quota:
hadoop dfsadmin -clrSpaceQuota /user/hduser

Configuring CapacityScheduler

Hadoop CapacityScheduler is a pluggable MapReduce job scheduler. The goal is to maximize the Hadoop cluster utilization by sharding the cluster among multiple users. CapacityScheduler uses queues to quarantee the minimum share of each user. It has features of being secure, elastic, operable, and supporting job priority.
Configure CapacityScheduler:

configure Hadoop to use CapacityScheduler:


    mapred.jobtracker.taskScheduler
    org.apache.hadoop.mapred.CapacityTaskScheduler

Define a new queue, hdqueue:


    mapred.queue.names
    
    default,hdqueue

Configure CapacityScheduler queues in $HADOOP_HOME/conf/capacity-scheduler.xml:


    mapred.capacity-scheduler.queue.hdqueue.capacity
    20



    mapred.capacity-scheduler.queue.default.capacity
    80



    mapred.capacity-scheduler.queue.hdqueue.minimum-user-limit-percent
    20



    mapred.capacity-scheduler.maximum-system-jobs
    10
    Maximum number of jobs in the system which can be initialized, concurrently, by the CapacityScheduler.



    mapred.capacity-scheduler.queue.hdqueue.maximum-initialized-active-tasks
    500
    The maximum number of tasks, across all jobs in the queue, which can be initialzied concurrently. Once the queue's jobs exceed this limit they will be queued on disk.



    mapred.capacity-scheduler.queue.hdqueue.maximum-initialized-active-tasks-per-user
    100



    mapred.capacity-scheduler.queue.hdqueue.supports-priority
    true

stop-mapred.sh
start-mapred.sh
Get the schedule details of each queue by Opening the URL.
Test the queue configuration by submitting example wordcount job to the queue hdqueue:
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount -Dmapred.job.queue.name=hdqueue randtext wordcount.out

Hadoop supports access control on the queue using queue ACLs. Queue ACLs control the authorization of MapReducer job submission to a queue.

Configuring Fair Scheduler

Similar to Capacity Scheduler, Fair Scheduler was designed to enforce fair shares of cluster resources in a multiuser envrionment.
Configure Hadoop Fair Scheduler:

Enable fair scheduling in $HADOOP_HOME/conf/mapred-site.xml


    mapred.jobtracker.taskScheduler
    org.apache.hadoop.mapred.FairScheduler

Create the Fair Scheduelr configuration file, $HADOOP_HOME/conf/fair-scheduler.xml:

stop-mapred.sh
start-mapred.sh

The Hadoop Fair Scheduler schedules jobs in such a way that all jobs can get an equal share of computing resources. Jobs are organized with scheduling pools. A pool can be configured for each Hadoop user. If the pool for a user is not configured, the default pool will be used. A pool specifies the amounts of resources a user can share on the cluster, for example the number of map slots, reducer slots, the total number of running jobs, and so on.

minMaps and minReduces are used to ensure the minimum share of computing slots on the cluster of a pool. The minimum share guarantee can be useful when the required number of computing slots is larger than the number of configured slots. In case the minimum share of a pool is not met, JobTracker will kill tasks on other pools and assign the slots to the starving pool. In such cases, the JobTracker will restart the killed tasks on other nodes and thus, the job will take a longer time to finish. (这里还真没看明白，JOB 执行时间变长了，怎么还是 useful? 还是说后面解释的 in case 与上文没有关系？)

Besides computing slots, the Fair Scheduler can limit the number of concurrency running jobs and tasks on a pool. So ,if a user submits more jobs than the configured limit, some jobs have to in-queue until other jobs finish. In such a case, higher priority jobs will be scheduled by the Fair Scheduler to run earlier than lower priority jobs. If all jobs in the waiting queue have the same priority, the Fair Scheduler can be configured to schedule these jobs with either Fair Scheduler or FIFO Scheduler.

Properties supported by Fair Scheduler:

Property	Value	Description
minMaps	integer	Minimum map slots of a pool
minReduces	integer	Minimum reduce slots of a pool
maxMaps
maxReduces
schedulingMode	Fair/FIFO	Pool internal scheduling mode
maxRunningJobs	Integer	Maximum number of concurrently running jobs for a pool. Default value is unlimited.
weight	float	Value to control non-proportaional share of a cluster resoruce. The default value is 1.0
minSharePreemptionTimeout	integer	Seconds to wait before killing other pool's tasks if a pool's share is under the minimum share.
poolMaxJobsDefault	integer	Default maximum number of concurrently running jobs for a pool.
userMaxJobsDefault	integer	Default maximum number of concurrently running jobs for a user
defaultMinSharePreemptionTimeout	integer	Default seconds to wait before killing other pools' tasks when a pool's share is uner minimum share
fairSharePreemptionTimeout	integer	Pre-emption time when a job's resource is below haflf of the fair share.
defaultPoolSchedulingMode	Fair/FIFO	Default in-pool scheduling mode

Configuring Hadoop daemon logging

Check the current logging level of JobTracker:
hadoop daemonlog -getlevel master:50030 org.apache.hadoop.mapred.JobTracker
Tell Hadoop to only log error events for JobTracker:
hadoop daemonlog -setlevel master:50030 org.apache.hadoop.mapred.JobTracker ERROR
Get the log for TaskTracker, NameNode, and DataNode:
hadoop daemonlog -getlevel master:50070 org.apache.hadoop.dfs.NameNode
hadoop daemonlog -getlevel master:50030 org.apache.hadoop.mapred.TaskTracker
hadoop daemonlog -getlevel master:50070 org.apache.hadoop.dfs.DataNode

By default, Hadoop sends log messages to Log4j, which is configured in the fiel $HADOOP_HOME/conf/log4j.properies. This file defines both what to log and where to log. For applications, the default root logger is INFO, console, which logs all messages at level INFO and above the console's stderr. Log files are named $HADOOP_LOG_DIR/hadoop-$HADOOP_IDENT_STRING-.log

Hadoop supoorts a number of log leveles for different purposes. The log level should be tuned based on the purpose of logging.

Logging levels provided by Log4j:

Log level	Description
ALL	The lowest logging level, all loggings will be tuned on.
DEBUG	Logging events useful for debugging applications.
ERROR	Logging error events, but applications can continue to run.
FATAL	Logging very severe error events that will abort application.
INFO	Logging informational messages that indicates the progress of applications.
OFF	Logging will bw turned off.
TRACE	Logging more finger-grained events for application debugging.
TRACE_INT	Logging in TRACE level on integer values.
WARN	Logging potentially harmful events.

Configuring Hadoop logging with hadoop-env.sh

Open the file $HADOOP_HOME/conf/hadoop-env.sh:

export HADOOP_LOG_DIR=/var/log/hadoop

Other environment variables:

Variable name	Description
HADOOP_LOG_DIR	Directory for log files.
HADOOP_PID_DIR	Directory to store the PID for the events.
HADOOP_ROOT_LOGGER	Logging configuration for hadoop.root.logger.default, "INFO,console"
HADOOP_SECURITY_LOGGER	Logging configuration for hadoop.security.logger.default, "INFO,NullAppender"
HADOOP_AUDIT_LOGGER	Logging configuration for hdfs.audit.logger.default, "INFO,NullAppender"

Configuring Hadoop security logging

Security logging can help Hadoop cluster administrators to identify security problems.It is enabled by default. The security logging configuration is located in the file $HADOOP_HOME/conf/log4j.properties. By default, the security logging information is appended to the same file as NameNode logging.
grep security $HADOOP_HOME/logs/hadoop-hduser-namenode-master.log

Configuring Hadoop audit logging

Audit logging might be required for data processing systems such as Hadoop. In Hadoop, audit logging has been implemnted using the Log4j Java logging library at the INFO logging level. By default, Hadoop audit logging is disabled.
Configure Hadoop audit logging:

Enable audit logging in the $HADOOP_HOME/conf/log4j.properties file:

log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=INFO

Try making a directory on HDFS:
hadoop fs -mkdir audittest
Check the audit log messages in the NameNode log file:
grep org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit $HADOOP_HOME/logs/hadoop-hduser-namenode-master.log
The Hadoop NameNode is responsible for managing audit logging messages, which are forwarded to the NameNode logging facility.

Separate the audit logging messages from the NameNode logging messages by configuring the file $HADOOP_HOME/conf/log4j.properties:

# Log at INFO level, SYSLOG appenders
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=INFO

# Disable forwarding the audit logging message to the NameNode logger.
log4j.additivity.org.apche.hadoop.hdfs.server.namenode.FSNamesystem.audit=false

###########################################
# Configure logging appender
###########################################
#
# Daily Rolling File Appender (DRFA)
log4j.appender.DRFAAUDIT=org.apache.log4j.DailyRollingFileAppender
log4j.appender.DRFAAUDIT.File=$HADOOP_HOME/log/audit.log
log4j.appender.DRFAAUDIT.DatePattern=.yyyy-MM-dd
log4j.appender.DRFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.DRFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c:%m%n

Hadoop logs auditing messages of operations, such as creating, changing or deleting files into a configured log file. By default, audit logging is set to WARN, which disables audit logging. To enable it, the logging level needs be changed to INFO.

When a Hadoop cluster has many jobs to run, the log file can become large very quickly. Log file rotation is a function that periodically rotates a log file to a different name.

Upgrading Hadoop

In the process of upgrading a Hadoop cluster, we want to minimize the damage to the data stored on HDFS, and this proceduer is the cause of most of the upgrade problems. The data damages can be caused by either human operation or software and hardware failures. So, a backup of the data might be necessay.But the sheer size of the dta on HDFS can be a headache for most of the upgrade experience.

A more practical way is to only back up the HDFS filesystem metadta on the master node, while leaving the data blocks intact. If some data blocks are lost after upgrade, Hadoop can automatically recover it from other backup replications.

stop-all.sh
Back up block locations of the data on HDFS:
hadoop fsck / -files -blocks -locations > dfs.block.locations.fsck.backup
Save the list of all files on the HDFS filesystem:
hadoop dfs -lsr / > dfs.namespace.lsr.backup
Save the description of each DataNode in the HDFS cluster:
hadoop dfsadmin -report > dfs.datanodes.report.backup
Copy the checkpoint files to a backup directory:
sudo cp dfs.name.dir/edits /backup sudo cp dfs.name.dir/image/fsimage /backup
Verify that no DataNode daemon is running:
for node in 'cat $HADOOP_HOME/conf/slaves'; do echo 'Checking node' $node ssh $node -C "jps" done
If any DataNode process is still running, kill the process.
ssh $node -C "jps | grep 'DataNode' | cut -d '\t' -f 1 | xargs kill -9"
Decompress the new verison of Hadoop archive file.
Copy the configuration files from the old configuration directory to the new one using one:
sudo cp $HADOOP_HOME/conf/* /usr/local/hadoop-1.2.0/conf/*
Update the Hadoop sysmbolic link to the Hadoop version:
sudo rm -rf /usr/local/hadoop
sudo ln -s /usr/local/hadoop-1.2.0 /usr/local/hadoop

Upgrade in the slave nodes:

for host in 'cat $HADOOP_HOME/conf/slaves'; do
    echo 'Configuring hadoop on slave node ' $host
    sudo scp -r /usr/local/hadoop-1.2.0 hduser@$host:/usr/local
    echo 'Making sysbolic link for Hadoop home directory on host ' $host
    sudo ssh hduser@$host -C "ln -s /usr/local/hadoop-1.2.0 /usr/local/hadoop"
done

Upgrade the NameNode:
hadoop namenode -upgrade
This command will convert the checkpoint to the new version format. We need to wait to let it finish.
start-dfs.sh
Get the list of all files on HDFS and compare its difference with the backed up one:
hadoop dfs -lsr / > dfs.namespace.lsr.new
diff dfs.namespace.lsr.new dfs.namespace.lsr.backup
The two files should have the same content if there is no error in the upgrade.
Get a new report of each DataNode in the cluster and compare the file with the backed up one:
hadoop dfsadmin -report > dfs.datanodes.report.new
diff dfs.datanodes.report.new dfs.datanodes.report.backup
Get the locations of all data blocks and compare the output with the previous backup:
hadoop fsck / -files -blocks -locations > dfs.block.locations.fsck.new
diff dfs.block.locations.fsck.new dfs.block.locations.fsck.backup
start-mapred.sh

Chapter 3 Configuring a Hadoop Cluster

Use the following commmands to install Hadoop on all nodes:

for host in master slave1 slave2 slave3 slave4 slave5; do
    echo 'Installing Hadoop on node: ' $host
    sudo rpm -ivh ftp://hadoop.admin/repo/hadoop-1.1.2-1.x86_64.rpm
done

Run a sampele MapReduce job with the following command:

#Option 20 is the number of tasks to run and 100000 specifies the size of the sample for each task.
hadoop jar $HADOOP_HOME/hadoop-examples*.jar pi 20 100000

Run adn example teragen job to generate 10 GB data on the HDFS with the following command:

#Option $((1024 * 1024 * 1024 * 10 / 100)) means how many lines of data will be generated with the total data size 10GB
hadoop jar $HADOOP_HOME/hadoop-examples*.jar teragen $((1024 * 1024 * 1024 * 10/100)) teraout

Common Problems

Can't start HDFS daemons

The NameNode on the master node has not been formatted:

hadoop namenode -format

Check that HDFS has been properly configured and daemon are running. If the output of jps command does not contaion the NameNode and SecondaryNameNode daemons, we need to check the configuration of HDFS.

jps

Open a new terminal and monitor the NameNode logfile on the master node. Alternatively, the hadoop jobtracker command will give the same error.

tail -f $HADOOP_HOME/logs/hadoop.hduser-namenode-master.log

Cluster is missing slave nodes

Most likely, this problem is caused by hostname resolution. To confirm, check the content of the /etc/hosts file.

MapRedue daemons can't be started

The following two reasons can cause this problem:

The HDFS daemons are not running, which can cause the MapReduce daemons to ping the NameNode daemon at a regualar interval, which can be illustrated with the following log output:

13/02/16 11:32:19 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 0
time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
13/02/16 11:32:20 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
13/02/16 11:32:21 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
13/02/16 11:32:22 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
13/02/16 11:32:23 INFO ipc.Client: Retrying connect to server:
master/10.0.0.1:54310. Already tried 4 time(s); retry policy is
RetryUpToMaximumCount

Configuration problems of MapReduce can cause the MapReduce daemons can't be started problem. Before starting a cluster, we need to make sure that the total amount of configured memory should be smaller than the total amount of system memory.
For example, suppose a slave host has 4 GB of memory, and we have configured 6 map slots, and 6 reduce slots with 512 MB of memory for each slots. So, we can compute the total configured task memory with the following formula :
6 * 512 + 6 * 512 = 6GB
As 6GB is larger than the system memory of 4GB, the system will not start. To clear this problem, we can decrease the number of map slots and reduce slots from 6 to 3.

Configuring ZooKeeper

sudo wget ftp://haoop.admin/repo/zookeeper-3.4.5.tar.gz -P /usr/local
cd /usr/local
sudo tar xvf zookeeper-3.4.5.tar.gz
sudo ln -s /usr/local/zookeepr-3.4.5.tar.gz /usr/local/zookeeper

Open ~/.bashrc file and add the following lines:

ZK_HOME=/usr/local/zookeeper
export PATH=$ZK_HOME/bin:$PATH

. ~/.bashrc
sudo mkdir -pv /hadoop/zookeeper/{data,log}

Create Java configuration file $ZK_HOME/conf/java.env:

JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH

Create the $ZK_HOME/conf/zookeeper.cfg file:

tickTime=2000
clientPort=2181
initLimit=5
syncLimit=2
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
server.4=slave3:2888:3888
server.5=slave4:2888:3888
server.6=slave5:2888:3888
dataDir=/hadoop/zookeepr/data
dataLogDir=/hadoop/zookeeper/log

Configure ZooKeeper on all slave nodes with the following command:

for host in cat $HADOOP_HOME/conf/slaves; do
    echo 'Configuring ZooKeeper on ' $host
    scp ~/.bashrc hduser@$host:~/
    sudo scp -R /usr/local/zookeeper-3.4.5 hduser@$host:/usr/local/
    echo 'Making symbolic link for ZooKeeper home directory on ' $host
    sudo ssh hduser@$host -C "ln -s /usr/local/zookeeper-3.4.5 /usr/local/zookeeper"
done

zkServer.sh start
zkCli.sh -server master:2181
zkServer.sh stop

Installing Mahout

sudo wget ftp://hadoop.admin/repo/mahout-distribution-0.7.tar.gz / usr/local
cd /usr/local
sudo tar xvf mahout-distribution-0.7.tar.gz
sudo ln -s /usr/local/mahout-distribution-0.7 /usr/local/mahout

Open the ~/.bashrc file

export MAHOUT_HOME=/usr/local/mahout
export PATH=$MAHOUT_HOME:$PATH

sudo yum install maven
cd $MAHOUT_HOME
sudo mvn compile
sudo mvn install
The install command will run all the tests by default; sudo mvn -DskipTests install command will ignore the tests.
cd example
sudo mvn compile

Verify Mahout configuration:

wget http://archive.ics.uci.edu/ml/databases/synthetic_control/ synthetic_control.data -P ~/
start-dfs.sh
start-mapred.sh
hadoop fs -mkdir testdata
hadoop fs -put ~/synthetic_control.data testdata
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

Chapter 2 Preparing for Hadoop Installation

Generally, for a small to medium-sized cluster with up to a hundred slave nodes, the NameNode, JobTracker, and SecondaryNameNode daemons can be put on the same master machine. When the cluter grows up to hundreds or even thousands of slave nodes, it becomes advisable to put these daemons on different machines.

从安全方面来看，NameNode 和 SecondaryNameNode 应该不放在同一主机上。

Empirically, recommend the following configurations for a small to medium-sized Hadoop cluster.

Node type	Node components	Recommended specification
Master Node	CPU	2 Quad Core, 2.0 GHz
	RAM	16 GB
	Hard drive	2 x 1TB SATA II 7200 RPM HDD or SSD
	Network card	1GBps Ethernet
Slave Node	CPU	2 Quad Core
	RAM	16 GB
	Hard drive	4 x 1TB HDD
	Network card	1GBps Ethernet

在此基础上，再根据主要业务是计算还是存储做对应的调整。其实，Master CPU 主频才 2.0, 这列出的标准是大大低于预期。

In default configuration, the master node is a single failure point. High-end computing hardware and secondary power supplies are suggest.

In Hadoop, each slave nodes simultaneously executes a number of map or reduce tasks. The maximum number of parallel map/reduce tasks are known as a map/reduce slots, which are configurable by a Hadoop administrator. Each slot is a computing unit consisting of CPU, memory disk I/O resources. When a slave node was assigned a task by the JobTracker, its TaskTracker will fork a JVM for that task, allocating a preconfigured amount of computing resources. In addition, each forked JVM also will incur a certain amount of memory requirements. Empiricaly, a Hadoop job can consume 1 GB to 2GB of memory for each CPU core. Higher data throughput requirement can incure high I/O operations for the majority of Hadoop jobs. That's why higher end and parallel hard drivers can help boost the cluster performance. To maximum parallelism, it is advisable to assign two slots for each CPU core. For example, if our slave node has two quad-core CPUs, we ca assign 2 x 4 x 2 = 16 (map only, reduece only, or both) slots in total this node.(The first 2 stands for the number of CPUs of the slave node, the number 4 represents the number of cores per CPU, and the other 2 means the number of slots per CPU core.)

注：

In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had. These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces. Essentially, YARN has no TaskTrackers, but just generic NodeManagers. Hence, there's no more Map slots and Reduce slots spearation. Everything depneds on the amount of memory in use/demanded.
YARN configuration reference , http://stackoverflow.com/questions/22069904/controling-and-monitorying-number-of-simultaneous-map-reduce-tasks-in-yarn

Designing the cluster network

集群节点应该位于同一个网段，不推荐通过 VPN 或路由来实现这个。

Nodes on the same rack(机架) can be interconnected with a 1 GBps Ethernete switch. Cluster level switches then connect the rack switches with faster links, such as 10 GBps optical fiber links, and other networks such as InfiniBand. The cluster-level switches may also interconnect with other cluster-level switches or even uplink to another higher level of switching infrastructure. With the increasing size of cluster, the network, at the same time, will become larger and more complex. Connection redundancies for network high availability can also increase its complexity.

Configuring the cluster administrator machine,below commands are executed on CentOS 6.3

Log in as hddmin and change hostname
sudo sed -i 's/^HOSTNAME.*$HOSTNAME=hadoop.admin/' /etc/sysconfig/network
mkdir -v ~/mnt ~/isoimages ~/repo (~/mnt as the mount point for ISO images. ~/isoimages will be used to contain the original image files. ~/repo will be used as the repository folder for network installation)
sudo yum -y install dhcp
sudo yum -y install vsftpd (install the ftp servers)
rsync rsync://mirror.its.dal.ca/centos/6.3/isos/x86_64/CentOS-6.3-x86_64-netinstall.iso ~/isoimages
or wget http://mirror.its.dal.ca/centos/6.3/isos/x86_64/CentOS-6.3-x86_64-netinstall.iso -P ~/isoimages
sudo mount ~/isoimages/ /CentOS-6.3-x86_64-netinstall.iso
cp -r ~/mnt/* ~/repo
sudo umount ~/mnt

Configure the DHCP Server

/etc/dhcp/dhcpd.conf

# Domain name
option domain-name  "hadoop.cluster";

# DNS hostname  or IP address
option domain-name-servers dlp.server.world;

# Default lease time
default-lease-time 600;

# Maximum lease time
max-lease-time  7200;

# Declare the DHCP server to be valid.
authoritative;

# Network address and subnet mask
subnet 10.0.0.0 netmask 255.255.255.0 {

# Range of lease IP address, should be based 
    # on the size of the network
    range dynamic-bootp 10.0.0.200 10.0.0.254;

    # Broadcast address
    option broadcast-address 10.0.0.255;

    # Default gateway
    option routers 10.0.0.1;
}

sudo service dhcpd start
sudo chkconfig dhcpd --level 3 on (Make the DHCP Server to survive a system reboot)

Configure the FTP server

/etc/vsftpd/vsftpd.conf

# The FTP server will run in standalone mode.
listen=YES

# Use Anonymous user.
anonymous_enable=YES

# Disable change root for local users.
chroot_local_user=NO

# Disable uploading and changing files.
write_enable=NO

# Enable logging of uploads and downloads.
xferlog_enable=YES

# Enable port 20 data transfer
connect_from_port_20=YES

# Specify directory for hosting the Lnux installation packages.
anon_ropot=~/repo

sudo service vsftpd start
ftp hadoop.admin

Creating the kickstart file and boot media

check the filesystem type of USB flash drive
blkid
If the TYPE attribute is other than vfat, use the following command to clear the first few blocks of the drive:
dd if=/dev/zero of=/dev/sdb1 bs=1M count=100

Create a kickstart file

sudo yum install system-config-kickstart

ks.cfg

#! /bin/bash
# Kickstart for CentOS 6.3 for Hadoop cluster.

#Install system on the machine
install

# Use ftp as the package repository
url --url ftp://hadoop.admin/repo

# Use the text installation interface
text

# Use UTF-8 encoded USA English as the language.
lang en_US.UTF_8

# Configure time zone.
timezone America/New_York

# Use USA keyboard
keyboard us

# Set bootloader location
bootloader --location=mbr --driveorder=sda rhgb quiet

# Set root password
rootpw --password=hadoop

#############################
# Partion the hard disk
#############################
# Clear the master boot record on the hard drive.
zerombr yes

# Clear existing partitions
clearpart --all --initlabel

# Clear /boot partition, size is in MB.
part /boot --fstype ext3 --size 128

# Create / (root) partiton.
part / --fstype ext3 --size 4096 --grow --maxsize 8192

# Create /var partition.
part /var --fstype ext3 --size 4096 --grow --maxsize 8192

# Create Hadoop data storage directory
part /hadoop --fstype ext3 --grow

# Create swap partition, 16GB, double size of the main memory.
# Change size according to your hardware memory configuration.
part swap --size 16384


#############################
# Configure Network device
#############################

# Use DHCP and disable IPv6
network --onboot yes --device eth0 --bootproto dhcp --noipv6

# Disable firewall.
firewall --disabled

# Put Selinux in permissive mode.
selinux --permissive


#############################
# Specify packages to install
#############################

# Automatically resolv package dependencies, 
# exclude installation of documents and ignore missing packages.
%packages --resolvedeps  --excludeddocs --ignoremissing

# Install core packages.
@Base

# Don't install OpenJDK.
-java

# Install wget
wget

# Install the vim text editor
vim

# Install the Emacs text editor
emacs

# Install rsync
rsync

# Install nmap network mapper.
nmap

%end


###################################
# Post installation configuration.
###################################

# Enable post process logging
%post --log=~/install-post.log

# Create Hadoop user hduser with password hduser.
useradd -m -p hduser hduser

# Create group Hadoop.
groupadd hadoop

# Change user hduser's current group to hadoop
usermod -g hadoop hduser

# Tell the nodes hostname and ip address of the admin machine.
echo "10.0.0.1 hadoop.admin" >> /etc/hosts

# Configure administrative privilege to hadoop group.

# Configure the kernel settings.
ulimit -u


#############################
# Startup services.
#############################

service sshd start
chkconfig sshd on

%end

# Reboot after installation.
reboot

# Disable first boot configuration.
firstboot --disable

Put the kickstart file into the root directory of the FTP server with the command:
cp ks.cfg ~/repo

Create a USB boot media

~/isolinux/grub.conf, add the following content:

default=0
splashimage=@SPLASHPATH@
timeout 0
hiddenmenu
title @PRODUCT@ @VERSION@
    kernel @KERNELPATH@ ks=ftp://hadoop.admin/ks.cfg
    initrd @INITRDPATH@

Make an ISO fiel from the isolinux directory using the following commands:

mkisofs -o CentOS6.3-x86_64--boot.iso \
-b ~/repo/isolinux/isolinux.bin \
-c ~/repo/isolinux/boot.cat \
-no-emul-boot \
-boot-load-size 4

Write the bootable ISO image into the USB flash drive:
dd if=~/CentOS6.3-x86-64-boot.iso of=/dev/sdb

Using Apache Hadoop and Impala together with MySQL for data analysis

2014年4月24日星期四

Chapter 1 Big Data and Hadoop

CentOS 6.3
Oracle JDK (Java Development Kit) SE 7
Hadoop 1.1.2
HBase 0.94.5
Hive 0.9.0
Pig 0.10.1
ZooKeeper 3.4.5
Mahout 0.7

Big Data 四个属性
volume
velocity
variety
value

成书时间是 2013.7

2014年4月23日星期三

升级到 14.04

早上开机，系统提示是否升级到 14.04？选择了 Yes。
过程挺快。升级后重启，打开浏览器，带有中文的 title 成了乱码。算是目前发现的惟一 bug 吧。
sudo apt-get install ttf-wqy-zenhei 后重启，小问题也修复了。

2014年4月17日星期四

收到购买空间邮件提醒

又收到美橙的邮件，意思大概是：至此扫黄打黑之际，我司四月到九月进行空间整理。言下之意，当然是如果有网站内容涉及到那些东东，空间就得关闭了。
实际，这与我还是没半毛钱关系。只是本来就开始讨厌各种限制，现在是更加厌恶罢了。

最开始购买空间是为了方便自己查询。偶尔勤劳时，也会记录点自己研究一些东东的的心得。但近两年来，偶尔的勤劳比昙花一现都难见了。下半年续约时还是别再续了吧。

2014年4月3日星期四

换系统

正式切换到 Ubuntu。其实与微软宣布不再维护 Windows XP 没关系。只是恰好是这个时机。