生活在别处: Running Hadoop (Multi-Node Cluster)

FROM: Running Hadoop On Ubuntu Linux (Multi-Node Cluster)

Structure

这里，使用两台主机来搭建多节点的 Hadoop cluster。最简单的方法是，先将两台主机都搭建好单节点的 Hadoop，然后将其一台设为 master (因为只有两台主机，所以也将其同时设为 slave)，另外一台为 slave。

Tutorial approach and structure.

准备

Configuring single-node cluster first

参看 Running Hadoop (Single-Node Cluster)。推荐两台主机都采用相同设置，如安装目录，JAVA 环境，配置等等。

在开始下面的配置前，停止两台主机上的 Single-node cluster。

Networking

这里，最重要的一点是用于测试的两台主机能相互访问。为了方便，master 的 IP 地址设为192.168.0.1，slave 的 IP 地址为 192.168.0.2，更新两台主机的 /etc/hosts 文件：
[bash gutter="false"]
#/etc/hosts (for master AND slave)
192.168.0.1 master
192.168.0.2 slave
[/bash]

SSH access

master 主机上的 hduser 帐号必须能通过 SSH 在不需要密码的情况下访问本机和 slave (ssh -h localhost -u hduser, ssh -h slave -u hduser)。如果有参照 Running Hadoop (Single-Node Cluster)，则只需将 hduser@master 的 public SSH key ($HOME/.ssh/id_rsa.pbu)复制到 hduser@slave 的 authorized_keys ($HOME/.ssh/authorized_keys)文件中，或者使用如下 SSH 命令：
[bash]
[hduser@master ~]$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
[/bash]
上面命令执行时提示输入 huser@slave 的登录密码，然后再复制 public SSH keys 到 $HOME/.ssh/authorized_keys。

最后一步检测 master 主机上 hduser 访问 master，slave ：

master to master
[bash gutter="false"]
[hduser@master ~]$ ssh master
The authenticity of host 'master (192.168.0.1)' can't be established.
RSA key fingerprint is 3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'master' (RSA) to the list of known hosts.
Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC 2007 i686
...
[hduser@master ~]$
[/bash]

master to slave:
[bash gutter="false"]
[hduser@master ~]$ ssh slave
The authenticity of host 'slave (192.168.0.2)' can't be established.
RSA key fingerprint is 74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave' (RSA) to the list of known hosts.
Ubuntu 10.04
...
[hduser@slave ~]$
[/bash]

Hadoop

Cluster overview

下面内容介绍怎么配置一台座做 master ，一台做为 slave 的多节点 Hadoop cluster。为了测试多节点间的数据传输和处理，而这里只有两台主机，所以，master 主机同时也会做为一个 slave 节点。

How the final multi-node cluster will look like.

master 节点在 HDFS 和 MapReduce layer 分别运行一「master」的 daemon，HDFS 存储层的名为 NameNode，而 MapReduce 处理层的名为 JobTracker。两台主机同时运行着为「slave」的 daemons，HDFS 层的名为 DataNode， MapReduce 层的名为 TaskTracker。简单的来说，「master」 daemon 负责协调和管理，「slave」daemon 则是负责实际性的数据存储和处理工作。

Masters vs Slaves

From the Hadoop 1.x documentation:

Typicaly one machine in the cluster is designed as the NameNode and another machine the as JobTracker, exclusively. These are the actual "master nodes". The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves or "worker nodes".

Configuration

conf/masters(master only)
conf/masters 文件定义多节点的 Hadoop cluster 中哪台主机将启动 secondary NameNode。在这里的示例中，即 master 主机。primary NameNode 和 JobTracker 运行的主机，则取决于在哪台主机上运行 bin/start-dfs.sh 和 bin/start-mapred.sh 脚本 (而如果运行的是 bin/start-all.sh 脚本，则primary NameNode 和 JobTracker 将会运行于同一台主机上)。注意也可以通过运行 bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker] 来手动启动 Hadoop daemon，这个时候，配置文件 conf/masters 和 conf/slaves 将不会被加载。

下面是来自 Hadoop HDFS user guider的一些关于 conf/masters 的详细信息：

The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by bin/start-dfs.sh on the nodes specified in conf/masters file.

重申一遍的是，哪台主机执行 bin/start-dfs.sh 脚本，哪台主机就是 primary NameNode.

按照如下类似规则更新 master 的 conf/masters ：
[bash gutter="false"]
master
[/bash]

conf/slaves (master only)
conf/slaves 中列出 Hadoop cluster 中的 slave daemons(DataNodes 和 TaskTrackers) 所在的主机，每行对应一个。示例中，为了让两台主机都处理和存储数据，所以 master 和 slave 都会被做为 Hadoop slave。

按照如下规则更新 master 主机上的 conf/slaves :
[bash gutter="false"]
master
slave
[/bash]

如果 cluster 中还有其它的 slave 节点，按如下方式直接将其加入 conf/slaves 文件即可：
[bash]
master
slave
anotherslave01
anotherslave02
anotherslave03
[/bash]
Note: master 主机上的 conf/slaves 文件只有运行 bin/start-dfs.h 和 bin/stop-dfs.sh 脚本时才会起作用。例如，如果想在处于运行状态中的 cluster 中加入一新的 DataNode，直接在新的 slave 主机上运行 bin/start-daemon.sh start datanode 即可。而使用 conf/slaves则是为了方便统一管理 cluster, 例如重启。The conf/slaves file on master is used only by the scripts like bin/start-dfs.sh or bin/stop-dfs.sh. For example, if you want to add DataNodes on the fly(which is not described in this tutorial yet), you can "manually" start the DataNode daemon on a new slave machine via bin/hadoop-daemon.sh start daemon. Using the conf/slaves file on the master simply helps you to make "full" cluster restarts easier.

conf/*-site.xml(all machines)
Note: As of Hadoop 0.20.x and 1.x, the configuration settings previously found in hadoop-site.xml were moved to conf/core-site.xml(fs.default.name), conf/mapred-site.xml(mapred.job.tracker) and conf/hdfs-site.xml(dfs.replication).

如果之前已参照 single-node cluster配置 cluster 中的所有主机为 single-node, 则只需要修改几个配置选项即可：

Important: You have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.

第一步，修改 conf/core-site.xml 中的 fs.default.name 选项，它用来指定 NameNode (the HDFS master) host 和 port。示例中，是 master 主机：
[bash gutter="false"]

<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
[/bash]

第二步，修改 conf/mapred-site.xml 文件中 mapred.job.tracker 选项，它用来指定 JobTracker (MapReduce master) host 和 port，这里示例中，同样是 master 主机。
[bash gutter="false"]

<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
[/bash]

第三部，修改 conf/hdfs-site.xml 文件中表示默认的块复制数 dfs.replication 选项，它用来定义对于每个单个文件将会有多少台主机保留其备份。如果该定义的值大于 cluster 中的 slave 节点数(即 DataNodes 总数)，日志文件中将会出现很多类似于 (Zero targets found, forbidden1.size=1) 的错误。

dfs.replication 的默认值为 3。这里，因为只有 2 个 slaves 节点，所以将其修改为 2：
[bash gutter="false"]

<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
[/bash]

Additional settings
Hadoop API Overview(页面底部)有列出其它值得注意的配置选项：

文件conf/mapred-site.xml：
- mapred.local.dir Determines where temporary MapReduce data is written. It also may be a list of directories.
- mapred.map.tasks As a rule of thumb, use 10x the number of slaves(i.e., the number of TaskTrackers).
- mapred.reduce.tasks As a rul the thumb, use 2x the number of slave processors(i.e., number of TaskTrackers).

Formatting the HDFS filesystem via the NameNode

在启动新的 Hadoop cluster 前，首先需要格式化 Hadoop 中的 HDFS 的 NameNode。注意，不要格式化处于运行中的 Hadoop，否则 HDFS 中的所有数据将会丢失。

格式化 HDFS ，实际就是初始化 dfs.name.dir 指定的目录。命令如下：
[bash]
[hduser@master /usr/local/hadoop]$ bin/hadoop namenode -format
... INFO dfs.Storage: Storage directory /tmp/hadoop/dfs/name has been successfully formatted.
[hduser@master /usr/local/hadoop]$
[/bash]
Background: HDFS 的 name table 是存储于 NameNode 的dfs.name.dir指定的本地文件系统中。NameNode 用 name table 来保存追踪和协调所有 DataNodes 的信息。 The HDFS name table is stored on the NameNode's local filesystem in the directory specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information for the DataNaode.

Starting the multi-node cluster

启动 cluster 包括两个步骤。第一 , 启动 HDFS daemon，即启动 master 上的 NameNode daemon 和所有 slaves(这里包括 master 和 slave) 上的 DataNode daemons；第二，启动 MapReduce，即 master 上的 JobTracker daemon 和所有 slaves(这里包括 master 和 slave) 上的 TaskTracker daemons。

HDFS daemon
想让哪台主机做为 primary NameNode，就在哪台主机上运行 bin/start-dfs.sh，同时，conf/slaves 中列出的主机就会做为 DataNodes。

这里示例中，master 上执行 bin/start-dfs.sh:
[bash gutter="false"]
[hduser@master /usr/local/hadoop]$ bin/start-dfs.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-master.out
slave: Ubuntu 10.04
slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-slave.out
master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-master.out
master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-master.out
[hduser@master /usr/local/hadoop]$
[/bash]

slave上，通过查看 logs/hadoop-hduser-datanode.slave.log 日志文件可知晓启动是否成功。
[bash gutter="false"]
... INFO org.apache.hadoop.dfs.Storage: Storage directory /tmp/hadoop/dfs/data is not formatted.
... INFO org.apache.hadoop.dfs.Storage: Formatting ...
... INFO org.apache.hadoop.dfs.DataNode: Opened server at 50010
... INFO org.mortbay.util.Credential: Checking Resource aliases
... INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
... INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@17a8a02
... INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
... INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
... INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
... INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075
... INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@56a499
... INFO org.apache.hadoop.dfs.DataNode: Starting DataNode in: FSDataset{dirpath=' /tmp/hadoop/dfs/data/current'}
... INFO org.apache.hadoop.dfs.DataNode: using BLOCKREPORT_INTERVAL of 3538203msec
[/bash]

从上面的 slave的日志文件可看出，它有格式化 dfs.data.dir 指定的存储目录。而如果这个目录在格式化之前不存在，它就会自动创建。

在这个时候，master 上下面的 JAVA 进程应该是处于运行中。
[bash]
[hduser@master /usr/local/hadoop]$ jps
14799 NameNode
15314 Jps
14880 DataNode
14977 SecondaryNameNode
[hduser@master /usr/local/hadoop]$
[/bash]
slave 上运行的则是：
[bash]
[hduser@slave /usr/local/hadoop]$ jps
15183 DataNode
15616 Jps
[hduser@slave /usr/local/hadoop]$
[bash]
</li>
<li>MapReduce daemons
哪台主机上上运行 JobTracker，就在哪台主机上运行 <code>bin/start-mapred.sh</code>，这样同时，<code>conf/slaves</code> 文件中列出的主机就会运行 TaskTrackers。

这里示例中， <code>master</code> 上运行 <code>bin/start-mapred.sh</code>：
[bash gutter=false]
[hduser@master /usr/local/hadoop]$ bin/start-mapred.sh
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-master.out
slave: Ubuntu 10.04
slave: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-slave.out
master: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-master.out
[hduser@master /usr/local/hadoop]$
[/bash]

slave 上，查看 logs/hadoop-hduser-tasktracker-slave.log 检测启动是否成功。
[bash gutter="false"]
... INFO org.mortbay.util.Credential: Checking Resource aliases
... INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
... INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@d19bc8
... INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
... INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
... INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
... INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50060
... INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@1e63e3d
... INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50050: starting
... INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050: starting
... INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: 50050
... INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_slave:50050
... INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050: starting
... INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_slave:50050
[/bash]

这个时候， master 上应该有下列 JAVA 进程：
[bash]
[hduser@master /usr/local/hadoop]$ jps
16017 Jps
14799 NameNode
15686 TaskTracker
14880 DataNode
15596 JobTracker
14977 SecondaryNameNode
[hduser@master /usr/local/hadoop]$
[/bash]
slave 上：
[bash]
[hduser@slave /usr/local/hadoop]$ jps
15183 DataNode
15897 TaskTracker
16284 Jps
[hduser@slave /usr/local/hadoop]$
[/bash]

Stopping the multi-node cluster

跟启动相似，停止 cluster 也包括两个步骤。流程正好跟启动时相反，第一，停止 MapReduce deamon，即master 上的 JobTracker 和所有 slaves(这里包括 master 和 slave) 上的 TaskTrackers。第二步，停止 HDFS daemons，即 master 上的 NameNode daemon 和所有 slaves(这里包括 master 和 slave) 上的 DataNode deamons。

MapReduce daemons
在运行 JobTracker 的主机上执行 bin/stop-mapred.sh，这不仅不会停止 MapReduce cluster 中的 JobTracker daemon ，也会停止 conf/slaves 中列出的主机上的 TaskTracker daemons。

在这里示例中，master 上运行 bin/stop-mapred.sh：
[bash]
[hduser@master /usr/local/hadoop]$ bin/stop-mapred.sh
stopping jobtracker
slave: Ubuntu 10.04
master: stopping tasktracker
slave: stopping tasktracker
[hduser@master /usr/local/hadoop]$
[/bash]

这个时候，master 上运行的 JAVA 进程是：
[bash]
[hduser@master /usr/local/hadoop]$ jps
14799 NameNode
18386 Jps
14880 DataNode
14977 SecondaryNameNode
[hduser@master /usr/local/hadoop]$
[/bash]

slave 上：
[bash]
[hduser@slave /usr/local/hadoop]$ jps
15183 DataNode
18636 Jps
[hduser@slave /usr/local/hadoop]$
[/bash]

HDFS daemons
在运行 NameNode 的主机上运行 bin/stop-dfs.sh 。执行这个命令，不仅会停止 NameNode daemon，也不会停止 conf/slaves 文件中列出的所有主机上的 DataNodes。

这里示例中，即在 master 上运行 bin/stop-dfs.sh
[bash]
[hduser@master /usr/local/hadoop]$ bin/stop-dfs.sh
stopping namenode
slave: Ubuntu 10.04
slave: stopping datanode
master: stopping datanode
master: stopping secondarynamenode
[hduser@master /usr/local/hadoop]$
[/bash]

这个时候，master 上有运行的 Java 进程是：
[bash]
[hduser@master /usr/local/hadoop]$ jps
18670 Jps
[hduser@master /usr/local/hadoop]$
[/bash]

slave 上：
[bash]
[hduser@slave /usr/local/hadoop]$jps
18894 Jps
[hduser@slave /usr/local/hadoop]$
[/bash]

Running a MapReduce job

按照 single-node cluster tutorial. 中描述的 Running a MapReduce job即可。

建议使用大数据集，这样在 master 和 slave上都会执行 Map 和 Reduce tasks。下面列出了 Gutenburg Project 上用于测试的电子书文档，包括之前 single-node 测试中提到的三个文档。

The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson

The Notebooks of Leonardo Da Vinci

Ulysses by James Joyce

The Art of War by 6th cent. B.C. Sunzi

The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle

The Devil’s Dictionary by Ambrose Bierce

Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3

下载上述文件的 Plain Text utf8 版本，复制到 HDFS 中，在 master 上执行 WordCount 的 MapReduce job，然后在 HDFS 查看 job 的运行结果。

下面是在 master 上的一个示例 job 输出：
[bash gutter="false"]
[hduser@master /usr/local/hadoop]$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
... INFO mapred.FileInputFormat: Total input paths to process : 7
... INFO mapred.JobClient: Running job: job_0001
... INFO mapred.JobClient: map 0% reduce 0%
... INFO mapred.JobClient: map 28% reduce 0%
... INFO mapred.JobClient: map 57% reduce 0%
... INFO mapred.JobClient: map 71% reduce 0%
... INFO mapred.JobClient: map 100% reduce 9%
... INFO mapred.JobClient: map 100% reduce 68%
... INFO mapred.JobClient: map 100% reduce 100%
.... INFO mapred.JobClient: Job complete: job_0001
... INFO mapred.JobClient: Counters: 11
... INFO mapred.JobClient: org.apache.hadoop.examples.WordCount$Counter
... INFO mapred.JobClient: WORDS=1173099
... INFO mapred.JobClient: VALUES=1368295
... INFO mapred.JobClient: Map-Reduce Framework
... INFO mapred.JobClient: Map input records=136582
... INFO mapred.JobClient: Map output records=1173099
... INFO mapred.JobClient: Map input bytes=6925391
... INFO mapred.JobClient: Map output bytes=11403568
... INFO mapred.JobClient: Combine input records=1173099
... INFO mapred.JobClient: Combine output records=195196
... INFO mapred.JobClient: Reduce input groups=131275
... INFO mapred.JobClient: Reduce input records=195196
... INFO mapred.JobClient: Reduce output records=131275
[hduser@master /usr/local/hadoop]$
[/bash]

slave 上 datanode...
[bash gutter="false"]
# from logs/hadoop-hduser-datanode-slave.log on slave
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_5693969390309798974 from /192.168.0.1
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_7671491277162757352 from /192.168.0.1
<<>>
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-7112133651100166921 to /192.168.0.2
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-7545080504225510279 to /192.168.0.2
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-4114464184254609514 to /192.168.0.2
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-4561652742730019659 to /192.168.0.2
<<>>
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_-2075170214887808716 from /192.168.0.2 and mirrored to /192.168.0.1:50010
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_1422409522782401364 from /192.168.0.2 and mirrored to /192.168.0.1:50010
... INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_-2942401177672711226 file /tmp/hadoop/dfs/data/current/blk_-2942401177672711226
... INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_-3019298164878756077 file /tmp/hadoop/dfs/data/current/blk_-3019298164878756077
[/bash]

slave 上 tasktracker...
[bash gutter="false"]
# from logs/hadoop-hduser-tasktracker-slave.log on slave
... INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_0001_m_000000_0
... INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_0001_m_000001_0
... task_0001_m_000001_0 0.08362164% hdfs://master:54310/user/hduser/gutenberg/ulyss12.txt:0+1561677
... task_0001_m_000000_0 0.07951202% hdfs://master:54310/user/hduser/gutenberg/19699.txt:0+1945731
<<>>
... task_0001_m_000001_0 0.35611463% hdfs://master:54310/user/hduser/gutenberg/ulyss12.txt:0+1561677
... Task task_0001_m_000001_0 is done.
... task_0001_m_000000_0 1.0% hdfs://master:54310/user/hduser/gutenberg/19699.txt:0+1945731
... LaunchTaskAction: task_0001_m_000006_0
... LaunchTaskAction: task_0001_r_000000_0
... task_0001_m_000000_0 1.0% hdfs://master:54310/user/hduser/gutenberg/19699.txt:0+1945731
... Task task_0001_m_000000_0 is done.
... task_0001_m_000006_0 0.6844295% hdfs://master:54310/user/hduser/gutenberg/132.txt:0+343695
... task_0001_r_000000_0 0.095238104% reduce > copy (2 of 7 at 1.68 MB/s) >
... task_0001_m_000006_0 1.0% hdfs://master:54310/user/hduser/gutenberg/132.txt:0+343695
... Task task_0001_m_000006_0 is done.
... task_0001_r_000000_0 0.14285716% reduce > copy (3 of 7 at 1.02 MB/s) >
<<>>
... task_0001_r_000000_0 0.14285716% reduce > copy (3 of 7 at 1.02 MB/s) >
... task_0001_r_000000_0 0.23809525% reduce > copy (5 of 7 at 0.32 MB/s) >
... task_0001_r_000000_0 0.6859089% reduce > reduce
... task_0001_r_000000_0 0.7897389% reduce > reduce
... task_0001_r_000000_0 0.86783284% reduce > reduce
... Task task_0001_r_000000_0 is done.
... Received 'KillJobAction' for job: job_0001
... task_0001_r_000000_0 done; removing files.
... task_0001_m_000000_0 done; removing files.
... task_0001_m_000006_0 done; removing files.
... task_0001_m_000001_0 done; removing files.
[/bash]

如果想查看 job 的运行结果，参照 Running a MapReduce job 中 retrieve the job result from HDFS 步骤。

Caveats

java.io.IOException: Incompatible namespaceIDs
如果在 DataNode 的日志文件 (logs/hadoop-hduser-datanode-.log) 中看到 java.io.IOException: Incompatible namespaceIDs 这样的错误，有可能是 HDFS-107(之前名为 HADOOP-1212) 中提到的问题。

错误的全部信息类似如下：
[bash gutter="false"]
... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094
at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281)
at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:230)
at org.apache.hadoop.dfs.DataNode.(DataNode.java:199)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1202)
at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1167)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)
[/bash]
下面是两种解决办法：
- Workaround 1: Start from scratch
  这个方法作者自己已经测试过，但是这个方法比较麻烦且有一些不好的结果。步骤：
  1. 停止 cluster
  2. 删除有问题的 DataNode 的数据目录，即 conf/hdfs-site.xml 中 dfs.data.dir 指定的目录。在这里的示例中，应该是 /tmp/hadoop/dfs/data 目录。
  3. 重新格式化 NameNode (注意：这个步骤后， HDFS 的数据将都会丢失！)
  4. 重启 cluster
- Workaround2: Updating namespacID of problematic DataNodes
  这个是由 Jared Stehler 提出的解决方法，原作者没有做过测试。这个解决的方法修改到的地方比较少，仅仅只需修改有问题的 DataNode 上的一文件。
  1. 停止 DataNode
  2. 修改 /current/VERSION 中 namespaceID 值为当前 NameNode 的值。
  3. 重启 DataNode
  示例中，应该是下面的相关文件：
  - NameNode: /tmp/hadoop/dfs/name/current/VERSION
  - DataNode: /tmp/hadoop/dfs/data/current/VERSION(Background: dfs.data.dir 为 ${hadoop.tmp.dir}/dfs/data)
  下面是一个 VERSION 中的内容：
  [bash gutter="false"]
  # contents of /current/VERSION
  namespaceID=393514426
  storageID=DS-1706792599-10.10.10.1-50010-1204306713481
  cTime=1215607609074
  storageType=DATA_NODE
  layoutVersion=-13
  [/bash]

生活在别处

2012年8月5日星期日

Running Hadoop (Multi-Node Cluster)