2012年8月29日星期三

MySQL MongoDB SQL 对应

FROM: SQL to Mongo Mapping Chart

































































































































































  SQL statement Mongo statement
DDL CREATE TABLE users (id INT AUTO_INCREMENT PRIMARY KEY, a INT, b INT, age INT, name CHAR(32)) implicit; can also be done explicitly with
db.createCollection("user1")
ALTER TABLE users ADD ... implicit
DROP TABLE users db.users.drop()
CREATE DATABSE db_name implicite
SHOW DATABASES show dbs
USE db_name use db_name
SHOW TABLES show collections
CREATE INDEX idx_name ON users(name) db.users.ensureIndex({name : 1})
CREATE INDEX idx_name_ts ON users(name, ts DESC) db.users.ensureIndex({name : 1, ts : -1})
DROP INDEX idx_name ON users db.users.dropIndex({name : 1, ts : -1})
SHOW INDEXES FROM users db.users.getIndexes()
DML INSERT INTO users VALUES(3, 5) db.users.insert({a: 3, b: 5})
SELECT a, b FROM users db.users.find({}, {a : 1, b: 1})
SELECT * FROM users db.users.find()
SELECT * FROM users WHERE age = 33 db.users.find({age : 33})
SELECT a, b FROM users WHERE age = 33 db.users.find({age : 33}, {a : 1, b : 1})
SELECT * FROM users WHERE age = 33 ORDER BY name ASC db.users.find({age : 33}).sort({name : 1})
SELECT * FROM users ORDER BY name DESC db.users.find().sort({name : -1})
SELECT * FROM users WHERE age > 33 db.users.find({age : {$gt : 33}})
SELECT * FROM users WHERE age != 33 db.users.find({age : {$ne : 33}})
SELECT * FROM users WHERE age > 33 AND age <= 44 db.users.find({age : {$gt: 33, $lte : 44}})
SELECT * FROM users WHERE name LIKE "%Joe%" db.users.find({name : /Joe/})
SELECT * FROM users WHERE id IN (3, 4, 5) db.users.find({id, {$in : [3, 4, 5]}})
SELECT * FROM users WHERE id NOT IN (3, 4, 5) db.users.find({id, {$nin : [3, 4, 5]}})
SELECT * FROM users WHERE id = 2 db.users.find({id , {$all : [2]}})
SELECT * FROM users WHERE a = 1 AND name = 'Joe' db.users.find({ a : 1, name : 'Joe'})
SELECT * FROM users WHERE a = 1 OR b = 2 db.users.find({$or : [{a : 1}, {b : 2}]})
SELECT * FROM users WHERE name LIKE "Joe%" db.users.find({name : /^Joe/})
SELECT * FROM users WHERE name LIKE "%Joe" db.users.find({name : /Joe$/})
SELECT * FROM users LIMIT 20, 10 db.users.find().limit(10).skip(20)
SELECT * FROM users LIMIT 1 db.users.findOne()
SELECT id FROM users u , users_extend e WHERE u.user_id = e.user_id AND e.c = 12345 db.users.find({"users_extend.c" : 12345}, {_id : 1})
SELECT customer.name FROM customers, orders WHERE orders.id = "q139" AND orders.custid = customers.id var o = db.orders.findOne({_id : "q139"})
var name = db.customers.findOne({_id : o.custid})
EXPLAIN SELECT * FROM users WHERE a = 3 db.users.find({a : 3}).explain()
UPDATE users SET a = 1 WHERE b = 2 db.users.update({b : 2}, {$set : {a : 1}}, false, true)
UPDATE users SET a = a + 2 WHERE b = 2 db.users.update({b : 2}, {$inc: {a : 2}}, false, true)
DELETE FROM users WHERE a = 3 db.users.remove({ a : 3})
DELETE FROM users db.users.remove()


MongoDB 中可用到的其它语句:

  • 存在某字段: db.users.find({a : {$exists : true}})

  • 不存在某字段: db.users.find({a : {$exists : false}})

  • 查询指定长度: db.users.find({name : {$size : 10}})

  • 查询字段是数组: db.users.find({name.first; : 'Joe', name.last : 'David'})


  • var cursor = db.users.find()
    while(cursor.hasNext()) printjson(curson.next())

  • db.users.find().forEach(printjson)

  • db.users.find().toArray()

2012年8月22日星期三

Gearman 安装及常见问题

Gearman 是分发任务的程序框架。因为它提供多语言 API ,所以可用于各场合,与 Hadoop 相比,Gearman 更偏向于任务分发和异步功能。Gearman 最初用于LiveJournal的图片 resize 功能。

安装


[bash]
# CXX=/usr/bin/g++44 CC=/usr/bin/gcc44 ./configure --prefix=/usr/local/gearman --enable-static-boost --with-mysql=/usr/local/mysql/bin/mysql_config
# make
# make install
[/bash]


安装 PHP 扩展


[bash]
[root@localdomain gearman-1.0.2]# /usr/local/php/bin/phpize
[root@localdomain gearman-1.0.2]# ./configure --with-php-config=/usr/local/php/bin/php-config --with-gearman=/usr/local/gearman/
[root@localdomain gearman-1.0.2]# make
[root@localdomain gearman-1.0.2]# make install
[/bash]
完后,修改 PHP 配置文件, 加入 gearman.so 扩展,重启 PHP,执行命令 php -i | grep gearman,若看到如下类似的信息,说明安装成功。
[bash]
[root@localdomain php]# bin/php -i | grep gearman
gearman
gearman support => enabled
libgearman version => 0.38
PWD => /usr/local/src/gearman-1.0.2
_SERVER["PWD"] => /usr/local/src/gearman-1.0.2
[/bash]

用 MySQL 做持久化存储,启动服务



  1. 在 MySQL 服务器中建立数据库和表结构:
    [sql]
    CREATE TABLE `gearman_queue` (
    `unique_key` varchar(64) NOT NULL,
    `function_name` varchar(255) DEFAULT NULL,
    `priority` int(11) DEFAULT NULL,
    `data` longblob,
    `when_to_run` int(11) DEFAULT NULL,
    PRIMARY KEY (`unique_key`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
    [/sql]
    Gearman 的 MySQL 持久化储存,默认表名是 gearman_queue,这可在服务启动时根据需要进行设置。另外,表结构设置如存储引擎可根据效率等进行修改。

  2. 启动服务
    [bash]
    [root@localdomain gearman]# /usr/local/gearman/sbin/gearmand -p 4730 --log-file=/tmp/gearmand-4730.log --pid-file=/tmp/gearmand-4730.pid -q MySQL --mysql-host=localhost --mysql-user=gearmand --mysql-password=123456 --mysql-db=gearman --verbose DEBUG -d
    [/bash]
    若上一步中建立的数据库表,其表名不是 gearman_queue,用参数 --mysql-table=TABLE_NAME 进行指定即可。



命令行工具



  • 启动 Worker bin/gearman -w -f wc -- wc -l &

  • 运行 Client bin/gearman -f wc < /tmp/gearmand-4730.log

  • 查看 Server 状态 bin/gearadmin -h [ --host ] -p [--port] --status

  • 查看 Server 运行中的 workers bin/gearadmin -h [ --host ] -p [--port] --workers

  • 关闭 Server bin/gearadmin -h [ --host ] -p [--port] --shutdown

  • 移除名为 FUNCTION_NAME 的 JOB bin/gearman -n -w -f FUNCTION_NAME > /dev/null

  • 移除 20 个名为 FUNCTION_NAME 的 JOB gearman -c 20 -n -w -f FUNCTION_NAME > /dev/null




安装常见问题



  1. 编译时,报错:configure: error: cannot find Boost headers version >= 1.39.0
    [bash]
    [root@localdomain gearmand-0.38]# yum search boost
    [root@localdomain gearmand-0.38]# yum install boost.x86_64
    [root@localdomain gearmand-0.38]# yum install yum install boost-devel.x86_64
    [/bash]

  2. uuid/uuid.h;No such file or directory
    错误代码:
    [bash]
    [root@localdomain gearmand-0.38]# make
    make all-am
    make[1]: Entering directory `/usr/local/src/gearmand-0.38'
    CXX libgearman/libgearman_libgearman_la-add.lo
    libgearman/add.cc:53:23: error: uuid/uuid.h: No such file or directory
    libgearman/add.cc: In function 'gearman_task_st* add_task(gearman_client_st&, gearman_task_st*, void*, gearman_command_t, const gearman_string_t&, gearman_unique_t&, const gearman_string_t&, time_t, const gearman_actions_t&)':
    libgearman/add.cc:154: error: 'uuid_t' was not declared in this scope
    libgearman/add.cc:154: error: expected ';' before 'uuid'
    libgearman/add.cc:155: error: 'uuid' was not declared in this scope
    libgearman/add.cc:155: error: 'uuid_generate' was not declared in this scope
    libgearman/add.cc:156: error: 'uuid_unparse' was not declared in this scope
    libgearman/add.cc: In function 'gearman_task_st* add_reducer_task(gearman_client_st*, gearman_command_t, gearman_job_priority_t, const gearman_string_t&, const gearman_string_t&, const gearman_unique_t&, const gearman_string_t&, const gearman_actions_t&, time_t, void*)':
    libgearman/add.cc:263: error: 'uuid_t' was not declared in this scope
    libgearman/add.cc:263: error: expected ';' before 'uuid'
    libgearman/add.cc:334: error: 'uuid' was not declared in this scope
    libgearman/add.cc:334: error: 'uuid_generate' was not declared in this scope
    libgearman/add.cc:335: error: 'uuid_unparse' was not declared in this scope
    make[1]: *** [libgearman/libgearman_libgearman_la-add.lo] Error 1
    make[1]: Leaving directory `/usr/local/src/gearmand-0.34'
    make: *** [all] Error 2
    [/bash]
    yum 解决如下,或者源代码安装,参考 uuid/uuid.h;No such file or directory
    [bash]
    [root@localdomain gearmand-0.38]# yum install e4fsprogs.x86_64
    [root@localdomain gearmand-0.38]# yum install e2fsprogs-devel.x86_64
    [/bash]

  3. configure: error: Unable to find libevent
    [bash]
    [root@localdomain gearmand-0.38]# yum install libevent.x86_64
    [root@localdomain gearmand-0.38]# yum install libevent-devel.x86_64
    [/bash]

  4. tr1/cinttypes: No such file or directory
    错误代码:
    [bash gutter="false"]
    make[1]: Entering directory `/usr/local/src/gearmand-0.38'
    CXX libgearman/libgearman_libgearman_la-actions.lo
    In file included from ./libgearman/common.h:44,
    from libgearman/actions.cc:39:
    ./libgearman-1.0/gearman.h:53:27: error: tr1/cinttypes: No such file or directory
    make[1]: *** [libgearman/libgearman_libgearman_la-actions.lo] Error 1
    make[1]: Leaving directory `/usr/local/src/gearmand-0.38'
    [/bash]
    解决:
    [bash]
    [root@localdomain gearmand-0.38]# yum install gcc44 gcc44-c++
    [/bash]

  5. 用 MySQL 做 Gearman Server 的持久化存储,make install 时始终无法通过。最后直接将 libgearman-server/plugins/queue/mysql/queue.cc 文件中 include MySQL 库文件的位置改为绝对路径才算解决这个问题。

  6. [bash]
    #include </usr/local/mysql/include/mysql/mysql.h>
    #include </usr/local/mysql/include/mysql/errmsg.h>
    [/bash]

  7. Trouble compiling gearman 0.22 with drizzle persistence on Centos 5.6

  8. ./configure doesn't fail for lack of c++

  9. compiling gearmand tr1/cinttypes in cent os



说明:

  • 版本说明: gearmand 0.38 , libgearman

  • 系统为 64 位,所以之上的软件都是安装 64 位的版本,若系统不是 64 位的,修改对应的版本即是。

  • 用 yum 安装,可能会收到找不到对应源的错误,或无对应需要的版本。在 /etc/yum.repos.d/drizzle.repo 文件中加入如下,重新执行命令即可。
    [bash gutter="false"]
    [drizzle]
    name=drizzle
    baseurl=http://rpm.drizzle.org/7-dev/redhat/$releasever/$basearch/
    enabled=1
    gpgcheck=0

    [drizzle-src]
    name=drizzle-src
    baseurl=http://rpm.drizzle.org/7-dev/redhat/$releasever/source
    enabled=1
    gpgcheck=0

    [drizzle-dev]
    name=drizzle-dev
    baseurl=http://rpm.drizzle.org/7-dev/redhat/5/$basearch/
    enabled=0
    gpgcheck=0
    [/bash]

Hadoop 常见问题和说明


  1. Warning: $HADOOP_HOME is deprecated.
    执行 Hadoop 命令时,收到警告:
    [bash]
    [hduser@master hadoop]$ ./bin/hadoop
    Warning: $HADOOP_HOME is deprecated.
    [/bash]
    取消警告,设置环境变量 $HADOOP_HOME_WARN_SUPPRESS=1。更简便的方法是将环境变量加入到 shell 配置文件中,假设 shell 为 bash:
    [bash]
    vim $HOME/.bashrc
    export HADOOP_HOME_WARN_SUPPRESS=1
    [/bash]

  2. Name node is in safe mode.
    对 Hadoop 操作时,收到 Name node is in safe mode的提示, e.g.
    [bash]
    [hduser@master hadoop]$ hadoop fs -put word-input/ wordinput
    put: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/hduser/wordinput. Name node is in safe mode.
    [/bash]
    如想立即退出安全模式,直接执行 hadoop dfsadmin -safemode leave命令。

    Background:
    NameNode 在启动的时候首先进入安全模式,如果 DataNode 丢失的 blocks 达到一定的比例(1 - dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态。
    dfs.safemode.threshold.pct (默认值为 0.999f,定义于 conf/hdfs-site.xml文件中),表示 HDFS 启动的时候, 如果 DataNode 上报的 blocks 个数达到最小备份级别 (默认是 1, 可以通过 dfs.replication.min 设置) blocks 个数的 0.999 倍 才可以离开安全模式,否则一直处于只读模式, 如果设为 1,则 HDFS 永远是处于 safemode, 而如果设为 0,NameNode 不会从安全模式启动。

    下面是来自某个 NameNode 启动时的日志,blocks 上报比例为 1 , 达到了阀值 0.999:

    The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 18 seconds.


    从上面可知,除了执行命令 hadoop dfsadmin -safemode leave 强制退出外,修改 dfs.safemode.threshold.pct 为较小的值也可退出安全模式。

    操作安全模式下的命令: hadoop dfsadmin -safemode [value]

    • enter 进入安全模式。

    • leave 强制 NameNode 离开安全模式。

    • get 返回安全模式是否开启的信息。

    • wait 等待,一直到安全模式结束。



  3. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
    Hadoop 运行时需要打开多个文件进行分析,系统一般默认是 1024(ulimit -a可查看),正常情况下这个大小已经足够,但在 Hadoop 处理大数据量时来说,这还不够大。解决办法,需修改两个文件:

    1. 文件 /etc/security/limits.conf:
      [bash]
      vim /etc/security/limits.conf
      #加上
      * soft nofile 102400
      * hard nofile 409600
      [/bash]

    2. 文件 /etc/pam.d/login
      [bash]
      # 32 位机器加上
      session required /lib/security/pam_limits.so
      # 64 位机器加上
      session required /lib64/security/pam_limits.so
      [/bash]



  4. Too many fetch-failures
    出现这个错误,表明 cluster 结点间的连接出现问题。

    1. 检查所有节点的 /etc/hosts, 除了本机 IP 需要对应 服务器名外,还要包含 cluster 中所有结点的 IP 和 服务器名。

    2. 检查 $HOME/.ssh/authorized_keys,要求包含访问用户的public key





  5. job 执行时,map 快,但是 reduce 很慢,且反复出现 reduce=0%
    修改 conf/hadoop-env.shexport HADOOP_HEAPSIZE=4000

  6. 能够启动 DataNode,但无法访问,也无法结束。
    在重新格式化一个新的分布式文件时,需要将 NameNode 上所配置的 dfs.name.dir 这一 NameNode 用来存放 NameNode 持久存储名字空间及事务日志的本地文件系统路径删除,同时将各 DataNode 上的 dfs.data.dir 的路径 DataNode 存放块数据的本地文件系统路径的目录也删除。这是因为 Hadoop 在格式化一个新的分布式文件系统时,每个存储的名字空间都对应了建立时间的那个版本(可查看/tmp/hadoop/dfs/name/current/VERSION目录下的 VERSION 文件,上面记录了版本信息),所以在重新格式化新的分布式系统文件时,最好先删除 name 目录。同时,也必须删除所有 DataNode 的 dfs.data.dir 指定的目录。这样才可以使 NameNode 和 DataNode 的版本信息对应。注意:删除很危险!

  7. java.io.IOException: Could not obtain block: blk_*********_*** file=/user/hduser/warehouse/src_***_log/src_***_log
    出现这种情况大多是没有连接上结点。

  8. java.lang.OutOfMemoryError: Java heap space
    出现这种异常,是 jvm 内存不够得,需修改所有的 DataNode 的 jvm 内存大小。
    [bash gutter="false"]
    Java -Xms1024m -Xmx4096m
    [/bash]
    一般 jvm 的最大内存使用应该为总内存大小的一半,若内存为 8G ,所以设置为 4096M ,这一值可能依旧不是最优的值。(其实最好设置为真实物理内存大小的 0.8)

  9. map 100 %,reduce 到 98% 左右时,job 直接进 failed
    检查 mapred.map.tasks是不是设置过大,设置太大的话会导致处理大量的小文件。另外,还需检查 mapred.reduce.parallel.copies 是否设置合适。

  10. hadoop.tmp.dir
    hadoop.tmp.dir 修改后,master 需要 hadoop namenode -format 重新将 HDFS 格式化,且需要删除从库的 hadoop.data.dir 数据目录。


  11. dfs.replication 和 dfs.block.size
    dfs.replication 表示数据备份的数量,默认是 3dfs.block.size,block 的大小,单位字节,它必须是必须是 512 的倍数,因为采用 crc 作文件完整性交验,默认配置 512 是 checksum 的最小单元。

  12. ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried

    • 这可能是由于 HDFS format 失败引起的,format 失败的原因可能是用户对 hadoop.tmp.dir 目录的权限不够,可以执行
      [bash]
      # 假设 hadoop.tmp.dirvalue 为 /tmp/hadoop
      $ sudo chown -R hduser /tmp/hadoop
      [/bash]


    • [bash gutter="false"]
      [hduser@hadoop hadoop]# hadoop fs -ls
      12/08/01 15:14:51 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 0 time(s).
      12/08/01 15:14:52 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 1 time(s).
      12/08/01 15:14:53 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 2 time(s).
      12/08/01 15:14:54 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 3 time(s).
      12/08/01 15:14:55 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 4 time(s).
      12/08/01 15:14:56 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 5 time(s).
      12/08/01 15:14:57 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 6 time(s).
      12/08/01 15:14:58 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 7 time(s).
      12/08/01 15:14:59 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 8 time(s).
      12/08/01 15:15:00 INFO ipc.Client: Retrying connect to server: hadoop.master/192.168.0.3:54310. Already tried 9 time(s).
      Bad connection to FS. command aborted. exception: Call to hadoop.master/192.168.0.3:54310 failed on connection exception: java.net.ConnectException: Connection refused
      [/bash]
      服务未启动。master 上执行命令启动服务:
      [bash]
      [hduer@master /usr/local/hadoop]$ bin/start-hdfs.sh
      [/bash]






References:

2012年8月5日星期日

Running Hadoop (Multi-Node Cluster)

FROM: Running Hadoop On Ubuntu Linux (Multi-Node Cluster)



Structure


这里,使用两台主机来搭建多节点的 Hadoop cluster。最简单的方法是,先将两台主机都搭建好单节点的 Hadoop,然后将其一台设为 master (因为只有两台主机,所以也将其同时设为 slave),另外一台为 slave。


Tutorial approach and structure.




准备


Configuring single-node cluster first

参看 Running Hadoop (Single-Node Cluster)。推荐两台主机都采用相同设置,如安装目录,JAVA 环境,配置等等。

在开始下面的配置前,停止两台主机上的 Single-node cluster。

Networking

这里,最重要的一点是用于测试的两台主机能相互访问。为了方便,master 的 IP 地址设为192.168.0.1slave 的 IP 地址为 192.168.0.2,更新两台主机的 /etc/hosts 文件:
[bash gutter="false"]
#/etc/hosts (for master AND slave)
192.168.0.1 master
192.168.0.2 slave
[/bash]

SSH access

master 主机上的 hduser 帐号必须能通过 SSH 在不需要密码的情况下访问本机和 slave (ssh -h localhost -u hduser, ssh -h slave -u hduser)。如果有参照 Running Hadoop (Single-Node Cluster),则只需将 hduser@master 的 public SSH key ($HOME/.ssh/id_rsa.pbu)复制到 hduser@slaveauthorized_keys ($HOME/.ssh/authorized_keys)文件中,或者使用如下 SSH 命令:
[bash]
[hduser@master ~]$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
[/bash]
上面命令执行时提示输入 huser@slave 的登录密码,然后再复制 public SSH keys 到 $HOME/.ssh/authorized_keys

最后一步检测 master 主机上 hduser 访问 master,slave :

master to master
[bash gutter="false"]
[hduser@master ~]$ ssh master
The authenticity of host 'master (192.168.0.1)' can't be established.
RSA key fingerprint is 3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'master' (RSA) to the list of known hosts.
Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC 2007 i686
...
[hduser@master ~]$
[/bash]

master to slave:
[bash gutter="false"]
[hduser@master ~]$ ssh slave
The authenticity of host 'slave (192.168.0.2)' can't be established.
RSA key fingerprint is 74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'slave' (RSA) to the list of known hosts.
Ubuntu 10.04
...
[hduser@slave ~]$
[/bash]

Hadoop


Cluster overview

下面内容介绍怎么配置一台座做 master ,一台做为 slave 的 多节点 Hadoop cluster。为了测试多节点间的数据传输和处理,而这里只有两台主机,所以,master 主机同时也会做为 一个 slave 节点。


How the final multi-node cluster will look like.



master 节点在 HDFS 和 MapReduce layer 分别运行一「master」的 daemon,HDFS 存储层的名为 NameNode,而 MapReduce 处理层的名为 JobTracker。两台主机同时运行着为「slave」的 daemons,HDFS 层的名为 DataNode, MapReduce 层的名为 TaskTracker。简单的来说,「master」 daemon 负责协调和管理,「slave」daemon 则是负责实际性的数据存储和处理工作。

Masters vs Slaves

From the Hadoop 1.x documentation:
Typicaly one machine in the cluster is designed as the NameNode and another machine the as JobTracker, exclusively. These are the actual "master nodes". The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves or "worker nodes".


Configuration


  1. conf/masters(master only)
    conf/masters 文件定义多节点的 Hadoop cluster 中哪台主机将启动 secondary NameNode。在这里的示例中,即 master 主机。primary NameNode 和 JobTracker 运行的主机,则取决于在哪台主机上运行 bin/start-dfs.shbin/start-mapred.sh 脚本 (而如果运行的是 bin/start-all.sh 脚本,则primary NameNode 和 JobTracker 将会运行于同一台主机上)。注意也可以通过运行 bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker] 来手动启动 Hadoop daemon,这个时候,配置文件 conf/mastersconf/slaves 将不会被加载。

    下面是来自 Hadoop HDFS user guider的一些关于 conf/masters 的详细信息:
    The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by bin/start-dfs.sh on the nodes specified in conf/masters file.


    重申一遍的是,哪台主机执行 bin/start-dfs.sh 脚本,哪台主机就是 primary NameNode.

    按照如下类似规则更新 masterconf/masters
    [bash gutter="false"]
    master
    [/bash]

  2. conf/slaves (master only)
    conf/slaves 中列出 Hadoop cluster 中的 slave daemons(DataNodes 和 TaskTrackers) 所在的主机,每行对应一个。示例中,为了让两台主机都处理和存储数据,所以 masterslave 都会被做为 Hadoop slave。

    按照如下规则更新 master 主机上的 conf/slaves :
    [bash gutter="false"]
    master
    slave
    [/bash]

    如果 cluster 中还有其它的 slave 节点,按如下方式直接将其加入 conf/slaves 文件即可:
    [bash]
    master
    slave
    anotherslave01
    anotherslave02
    anotherslave03
    [/bash]
    Note: master 主机上的 conf/slaves 文件只有运行 bin/start-dfs.hbin/stop-dfs.sh 脚本时才会起作用。例如,如果想在处于运行状态中的 cluster 中加入一新的 DataNode, 直接在新的 slave 主机上运行 bin/start-daemon.sh start datanode 即可。而使用 conf/slaves则是为了方便统一管理 cluster, 例如重启。The conf/slaves file on master is used only by the scripts like bin/start-dfs.sh or bin/stop-dfs.sh. For example, if you want to add DataNodes on the fly(which is not described in this tutorial yet), you can "manually" start the DataNode daemon on a new slave machine via bin/hadoop-daemon.sh start daemon. Using the conf/slaves file on the master simply helps you to make "full" cluster restarts easier.


  3. conf/*-site.xml(all machines)
    Note: As of Hadoop 0.20.x and 1.x, the configuration settings previously found in hadoop-site.xml were moved to conf/core-site.xml(fs.default.name), conf/mapred-site.xml(mapred.job.tracker) and conf/hdfs-site.xml(dfs.replication).


    如果之前已参照 single-node cluster配置 cluster 中的所有主机为 single-node, 则只需要修改几个配置选项即可:

    Important: You have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.

    第一步,修改 conf/core-site.xml 中的 fs.default.name 选项,它用来指定 NameNode (the HDFS master) host 和 port。示例中,是 master 主机:
    [bash gutter="false"]
    <!-- In: conf/core-site.xml -->
    <property>
    <name>fs.default.name</name>
    <value>hdfs://master:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>
    [/bash]

    第二步,修改 conf/mapred-site.xml 文件中 mapred.job.tracker 选项,它用来指定 JobTracker (MapReduce master) host 和 port,这里示例中,同样是 master 主机。
    [bash gutter="false"]
    <!-- In: conf/mapred-site.xml -->
    <property>
    <name>mapred.job.tracker</name>
    <value>master:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>
    [/bash]

    第三部,修改 conf/hdfs-site.xml 文件中表示默认的块复制数 dfs.replication 选项,它用来定义对于每个单个文件将会有多少台主机保留其备份。如果该定义的值大于 cluster 中的 slave 节点数(即 DataNodes 总数),日志文件中将会出现很多类似于 (Zero targets found, forbidden1.size=1) 的错误。

    dfs.replication 的默认值为 3。这里,因为只有 2 个 slaves 节点,所以将其修改为 2:
    [bash gutter="false"]
    <!-- In: conf/hdfs-site.xml -->
    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
    </property>
    [/bash]

  4. Additional settings
    Hadoop API Overview(页面底部)有列出其它值得注意的配置选项:

    文件conf/mapred-site.xml:

    • mapred.local.dir Determines where temporary MapReduce data is written. It also may be a list of directories.

    • mapred.map.tasks As a rule of thumb, use 10x the number of slaves(i.e., the number of TaskTrackers).

    • mapred.reduce.tasks As a rul the thumb, use 2x the number of slave processors(i.e., number of TaskTrackers).





Formatting the HDFS filesystem via the NameNode

在启动新的 Hadoop cluster 前,首先需要格式化 Hadoop 中的 HDFS 的 NameNode。注意,不要格式化处于运行中的 Hadoop,否则 HDFS 中的所有数据将会丢失。

格式化 HDFS ,实际就是初始化 dfs.name.dir 指定的目录。命令如下:
[bash]
[hduser@master /usr/local/hadoop]$ bin/hadoop namenode -format
... INFO dfs.Storage: Storage directory /tmp/hadoop/dfs/name has been successfully formatted.
[hduser@master /usr/local/hadoop]$
[/bash]
Background: HDFS 的 name table 是存储于 NameNode 的dfs.name.dir指定的本地文件系统中。NameNode 用 name table 来保存追踪和协调所有 DataNodes 的信息。 The HDFS name table is stored on the NameNode's local filesystem in the directory specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information for the DataNaode.

Starting the multi-node cluster

启动 cluster 包括两个步骤。第一 , 启动 HDFS daemon,即启动 master 上的 NameNode daemon 和 所有 slaves(这里包括 masterslave) 上的 DataNode daemons;第二,启动 MapReduce,即 master 上的 JobTracker daemon 和 所有 slaves(这里包括 masterslave) 上的 TaskTracker daemons。


  1. HDFS daemon
    想让哪台主机做为 primary NameNode,就在哪台主机上运行 bin/start-dfs.sh,同时,conf/slaves 中列出的主机就会做为 DataNodes。

    这里示例中,master 上执行 bin/start-dfs.sh:
    [bash gutter="false"]
    [hduser@master /usr/local/hadoop]$ bin/start-dfs.sh
    starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-master.out
    slave: Ubuntu 10.04
    slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-slave.out
    master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-master.out
    master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-master.out
    [hduser@master /usr/local/hadoop]$
    [/bash]

    slave上,通过查看 logs/hadoop-hduser-datanode.slave.log 日志文件可知晓启动是否成功。
    [bash gutter="false"]
    ... INFO org.apache.hadoop.dfs.Storage: Storage directory /tmp/hadoop/dfs/data is not formatted.
    ... INFO org.apache.hadoop.dfs.Storage: Formatting ...
    ... INFO org.apache.hadoop.dfs.DataNode: Opened server at 50010
    ... INFO org.mortbay.util.Credential: Checking Resource aliases
    ... INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
    ... INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@17a8a02
    ... INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
    ... INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
    ... INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
    ... INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075
    ... INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@56a499
    ... INFO org.apache.hadoop.dfs.DataNode: Starting DataNode in: FSDataset{dirpath=' /tmp/hadoop/dfs/data/current'}
    ... INFO org.apache.hadoop.dfs.DataNode: using BLOCKREPORT_INTERVAL of 3538203msec
    [/bash]

    从上面的 slave的日志文件可看出,它有格式化 dfs.data.dir 指定的存储目录 。而如果这个目录在格式化之前不存在,它就会自动创建。

    在这个时候,master 上下面的 JAVA 进程应该是处于运行中。
    [bash]
    [hduser@master /usr/local/hadoop]$ jps
    14799 NameNode
    15314 Jps
    14880 DataNode
    14977 SecondaryNameNode
    [hduser@master /usr/local/hadoop]$
    [/bash]
    slave 上运行的则是:
    [bash]
    [hduser@slave /usr/local/hadoop]$ jps
    15183 DataNode
    15616 Jps
    [hduser@slave /usr/local/hadoop]$
    [bash]
    </li>
    <li>MapReduce daemons
    哪台主机上上运行 JobTracker,就在哪台主机上运行 <code>bin/start-mapred.sh</code>,这样同时,<code>conf/slaves</code> 文件中列出的主机就会运行 TaskTrackers。

    这里示例中, <code>master</code> 上运行 <code>bin/start-mapred.sh</code>:
    [bash gutter=false]
    [hduser@master /usr/local/hadoop]$ bin/start-mapred.sh
    starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-master.out
    slave: Ubuntu 10.04
    slave: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-slave.out
    master: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-master.out
    [hduser@master /usr/local/hadoop]$
    [/bash]

    slave 上,查看 logs/hadoop-hduser-tasktracker-slave.log 检测启动是否成功。
    [bash gutter="false"]
    ... INFO org.mortbay.util.Credential: Checking Resource aliases
    ... INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
    ... INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@d19bc8
    ... INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
    ... INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
    ... INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
    ... INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50060
    ... INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@1e63e3d
    ... INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50050: starting
    ... INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050: starting
    ... INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: 50050
    ... INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_slave:50050
    ... INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050: starting
    ... INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_slave:50050
    [/bash]

    这个时候, master 上应该有下列 JAVA 进程:
    [bash]
    [hduser@master /usr/local/hadoop]$ jps
    16017 Jps
    14799 NameNode
    15686 TaskTracker
    14880 DataNode
    15596 JobTracker
    14977 SecondaryNameNode
    [hduser@master /usr/local/hadoop]$
    [/bash]
    slave 上:
    [bash]
    [hduser@slave /usr/local/hadoop]$ jps
    15183 DataNode
    15897 TaskTracker
    16284 Jps
    [hduser@slave /usr/local/hadoop]$
    [/bash]



Stopping the multi-node cluster

跟启动相似,停止 cluster 也包括两个步骤。流程正好跟启动时相反,第一,停止 MapReduce deamon,即master 上的 JobTracker 和所有 slaves(这里包括 masterslave) 上的 TaskTrackers。第二步,停止 HDFS daemons,即 master 上的 NameNode daemon 和所有 slaves(这里包括 masterslave) 上的 DataNode deamons。


  1. MapReduce daemons
    在运行 JobTracker 的主机上执行 bin/stop-mapred.sh,这不仅不会停止 MapReduce cluster 中的 JobTracker daemon ,也会停止 conf/slaves 中列出的主机上的 TaskTracker daemons。

    在这里示例中,master 上运行 bin/stop-mapred.sh
    [bash]
    [hduser@master /usr/local/hadoop]$ bin/stop-mapred.sh
    stopping jobtracker
    slave: Ubuntu 10.04
    master: stopping tasktracker
    slave: stopping tasktracker
    [hduser@master /usr/local/hadoop]$
    [/bash]

    这个时候,master 上运行的 JAVA 进程是:
    [bash]
    [hduser@master /usr/local/hadoop]$ jps
    14799 NameNode
    18386 Jps
    14880 DataNode
    14977 SecondaryNameNode
    [hduser@master /usr/local/hadoop]$
    [/bash]

    slave 上:
    [bash]
    [hduser@slave /usr/local/hadoop]$ jps
    15183 DataNode
    18636 Jps
    [hduser@slave /usr/local/hadoop]$
    [/bash]

  2. HDFS daemons
    在运行 NameNode 的主机上运行 bin/stop-dfs.sh 。执行这个命令,不仅会停止 NameNode daemon,也不会停止 conf/slaves 文件中列出的所有主机上的 DataNodes。

    这里示例中,即在 master 上运行 bin/stop-dfs.sh
    [bash]
    [hduser@master /usr/local/hadoop]$ bin/stop-dfs.sh
    stopping namenode
    slave: Ubuntu 10.04
    slave: stopping datanode
    master: stopping datanode
    master: stopping secondarynamenode
    [hduser@master /usr/local/hadoop]$
    [/bash]

    这个时候,master 上有运行的 Java 进程是:
    [bash]
    [hduser@master /usr/local/hadoop]$ jps
    18670 Jps
    [hduser@master /usr/local/hadoop]$
    [/bash]

    slave 上:
    [bash]
    [hduser@slave /usr/local/hadoop]$jps
    18894 Jps
    [hduser@slave /usr/local/hadoop]$
    [/bash]




Running a MapReduce job

按照 single-node cluster tutorial. 中描述的 Running a MapReduce job即可。


建议使用大数据集,这样在 masterslave上都会执行 Map 和 Reduce tasks。下面列出了 Gutenburg Project 上用于测试的电子书文档,包括之前 single-node 测试中提到的三个文档。


下载上述文件的 Plain Text utf8 版本,复制到 HDFS 中,在 master 上执行 WordCount 的 MapReduce job,然后在 HDFS 查看 job 的运行结果。

下面是在 master 上的一个示例 job 输出:
[bash gutter="false"]
[hduser@master /usr/local/hadoop]$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
... INFO mapred.FileInputFormat: Total input paths to process : 7
... INFO mapred.JobClient: Running job: job_0001
... INFO mapred.JobClient: map 0% reduce 0%
... INFO mapred.JobClient: map 28% reduce 0%
... INFO mapred.JobClient: map 57% reduce 0%
... INFO mapred.JobClient: map 71% reduce 0%
... INFO mapred.JobClient: map 100% reduce 9%
... INFO mapred.JobClient: map 100% reduce 68%
... INFO mapred.JobClient: map 100% reduce 100%
.... INFO mapred.JobClient: Job complete: job_0001
... INFO mapred.JobClient: Counters: 11
... INFO mapred.JobClient: org.apache.hadoop.examples.WordCount$Counter
... INFO mapred.JobClient: WORDS=1173099
... INFO mapred.JobClient: VALUES=1368295
... INFO mapred.JobClient: Map-Reduce Framework
... INFO mapred.JobClient: Map input records=136582
... INFO mapred.JobClient: Map output records=1173099
... INFO mapred.JobClient: Map input bytes=6925391
... INFO mapred.JobClient: Map output bytes=11403568
... INFO mapred.JobClient: Combine input records=1173099
... INFO mapred.JobClient: Combine output records=195196
... INFO mapred.JobClient: Reduce input groups=131275
... INFO mapred.JobClient: Reduce input records=195196
... INFO mapred.JobClient: Reduce output records=131275
[hduser@master /usr/local/hadoop]$
[/bash]

slave 上 datanode...
[bash gutter="false"]
# from logs/hadoop-hduser-datanode-slave.log on slave
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_5693969390309798974 from /192.168.0.1
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_7671491277162757352 from /192.168.0.1
<<>>
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-7112133651100166921 to /192.168.0.2
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-7545080504225510279 to /192.168.0.2
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-4114464184254609514 to /192.168.0.2
... INFO org.apache.hadoop.dfs.DataNode: Served block blk_-4561652742730019659 to /192.168.0.2
<<>>
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_-2075170214887808716 from /192.168.0.2 and mirrored to /192.168.0.1:50010
... INFO org.apache.hadoop.dfs.DataNode: Received block blk_1422409522782401364 from /192.168.0.2 and mirrored to /192.168.0.1:50010
... INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_-2942401177672711226 file /tmp/hadoop/dfs/data/current/blk_-2942401177672711226
... INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_-3019298164878756077 file /tmp/hadoop/dfs/data/current/blk_-3019298164878756077
[/bash]

slave 上 tasktracker...
[bash gutter="false"]
# from logs/hadoop-hduser-tasktracker-slave.log on slave
... INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_0001_m_000000_0
... INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_0001_m_000001_0
... task_0001_m_000001_0 0.08362164% hdfs://master:54310/user/hduser/gutenberg/ulyss12.txt:0+1561677
... task_0001_m_000000_0 0.07951202% hdfs://master:54310/user/hduser/gutenberg/19699.txt:0+1945731
<<>>
... task_0001_m_000001_0 0.35611463% hdfs://master:54310/user/hduser/gutenberg/ulyss12.txt:0+1561677
... Task task_0001_m_000001_0 is done.
... task_0001_m_000000_0 1.0% hdfs://master:54310/user/hduser/gutenberg/19699.txt:0+1945731
... LaunchTaskAction: task_0001_m_000006_0
... LaunchTaskAction: task_0001_r_000000_0
... task_0001_m_000000_0 1.0% hdfs://master:54310/user/hduser/gutenberg/19699.txt:0+1945731
... Task task_0001_m_000000_0 is done.
... task_0001_m_000006_0 0.6844295% hdfs://master:54310/user/hduser/gutenberg/132.txt:0+343695
... task_0001_r_000000_0 0.095238104% reduce > copy (2 of 7 at 1.68 MB/s) >
... task_0001_m_000006_0 1.0% hdfs://master:54310/user/hduser/gutenberg/132.txt:0+343695
... Task task_0001_m_000006_0 is done.
... task_0001_r_000000_0 0.14285716% reduce > copy (3 of 7 at 1.02 MB/s) >
<<>>
... task_0001_r_000000_0 0.14285716% reduce > copy (3 of 7 at 1.02 MB/s) >
... task_0001_r_000000_0 0.23809525% reduce > copy (5 of 7 at 0.32 MB/s) >
... task_0001_r_000000_0 0.6859089% reduce > reduce
... task_0001_r_000000_0 0.7897389% reduce > reduce
... task_0001_r_000000_0 0.86783284% reduce > reduce
... Task task_0001_r_000000_0 is done.
... Received 'KillJobAction' for job: job_0001
... task_0001_r_000000_0 done; removing files.
... task_0001_m_000000_0 done; removing files.
... task_0001_m_000006_0 done; removing files.
... task_0001_m_000001_0 done; removing files.
[/bash]

如果想查看 job 的运行结果,参照 Running a MapReduce job 中 retrieve the job result from HDFS 步骤。

Caveats


  1. java.io.IOException: Incompatible namespaceIDs
    如果在 DataNode 的日志文件 (logs/hadoop-hduser-datanode-.log) 中看到 java.io.IOException: Incompatible namespaceIDs 这样的错误,有可能是 HDFS-107(之前名为 HADOOP-1212) 中提到的问题。

    错误的全部信息类似如下:
    [bash gutter="false"]
    ... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094
    at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281)
    at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121)
    at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:230)
    at org.apache.hadoop.dfs.DataNode.(DataNode.java:199)
    at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1202)
    at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146)
    at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1167)
    at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)
    [/bash]
    下面是两种解决办法:

    • Workaround 1: Start from scratch
      这个方法作者自己已经测试过,但是这个方法比较麻烦且有一些不好的结果。步骤:

      1. 停止 cluster

      2. 删除有问题的 DataNode 的数据目录,即 conf/hdfs-site.xmldfs.data.dir 指定的目录。在这里的示例中,应该是 /tmp/hadoop/dfs/data 目录。

      3. 重新格式化 NameNode (注意:这个步骤后, HDFS 的数据将都会丢失!)

      4. 重启 cluster



    • Workaround2: Updating namespacID of problematic DataNodes
      这个是由 Jared Stehler 提出的解决方法,原作者没有做过测试。这个解决的方法修改到的地方比较少,仅仅只需修改有问题的 DataNode 上的一文件。

      1. 停止 DataNode

      2. 修改 /current/VERSIONnamespaceID 值为当前 NameNode 的值。

      3. 重启 DataNode



      示例中,应该是下面的相关文件:

      • NameNode: /tmp/hadoop/dfs/name/current/VERSION

      • DataNode: /tmp/hadoop/dfs/data/current/VERSION(Background: dfs.data.dir${hadoop.tmp.dir}/dfs/data)



      下面是一个 VERSION 中的内容:
      [bash gutter="false"]
      # contents of /current/VERSION
      namespaceID=393514426
      storageID=DS-1706792599-10.10.10.1-50010-1204306713481
      cTime=1215607609074
      storageType=DATA_NODE
      layoutVersion=-13
      [/bash]



Running Hadoop (Single-Node Cluster)

FROM: Running Hadoop On Ubuntu Linux (Single-Node Cluster)


Hadoop 是用 Java 编写对大数据量进行分布式处理的框架,包括 MapReduce 和 HDFS 两部分。HDFS 是用来部署在低廉的硬件上的分布式文件系统, 它不仅容错性高,而且具有高传输率,这为分布式计算存储提供了底层支持。 MapReduce 包括任务分解与结果汇总,
详细了解可看这篇论文 MapReduce: Simplified Data Processing on Large Custers

准备


Sun Java 6

Hadoop 需要 Java 1.5.x(aka 5.0.x)的工作环境,推荐使用 Java 1.6.x(aka 6.0x aka 6),安装请看这里
添加系统专用帐号

[bash]
addgroup hadoop
adduser --ingroup hadoop hduser
[/bash]
Configuring SSH

Hadoop 通过 SSH 去控制各个结点。假设测试机器上已提供 SSH 服务,且允许使用 SSH public key 认证。下面步骤是为用户 hduesr 生成 SSH key :
[bash]
su - hduser
ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@localdomain
The key's randomart image is:
[...snipp...]
[/bash]
第二行将会生成一无密码的 RSA key 对。通常来说,不推荐使用空密码,但因为在这里需要直接访问,而不是每次 Hadoop 与节点交互时还需要手动输入 Passphrase。
[bash]
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
[/bash]
这样,SSH 将会使用新生成的 key 来访问。
最后一步,测试 SSH 使用 hduser 是否可以访问机器。这一步需要将机器的 host key fingerprint 加入到 hduser 的 known_hosts 文件中。如果 SSH 还有其它特殊的配置如不使用标准的 22 端口,可以将其这些配置选项定义到 $HOME/.ssh/config 文件中。
[bash]
[hduser@localdomain ~]# ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
...
[hduser@localdomain ~]
[/bash]
如果连接失败,下面 Notes 可供参考:

  • ssh -vvv localhost使用该命令激活 Debug,这样将可看到错误的详细信息。

  • 检查 SSH server 的配置文件 /etc/ssh/sshd_config,特别需要注意的是 PubkeyAuthentication (需设置为 yes) 和 AllowUsers (如果这个选项有激活,需要将 hduser 加入其中)选项。若有修改任何的配置,需要重启 SSH 。/etc/init.d/sshd restart


Disabling IPv6

如果不需要使用到 IPv6 ,可直接在 /etc/sysctl.conf 中 disable 系统的 IPv6。
[bash]
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
[/bash]
修改该配置后,需要重启机器才能使其生效。

可用如下命令检查机器是否有启用 IPv6:
[bash]
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
[/bash]
如果返回是 0说明 IPv6 已经启用,而返回值是 1时则是没有启用。

或者
根据文档 https://issues.apache.org/jira/browse/HADOOP-3437 中所描述,也可以在 conf/hadoop-env.sh 文件中加入如下选项使 Hadoop disable IPv6。
[bash]
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
[/bash]

Hadoop


Installation

Apache Download Mirrors下载 Hadoop后,移至目标目录,解压,修改其属主。
[bash]
$ cd /usr/local
$ sudo tar xzf hadoop-1.0.3.tar.gz
$ sudo mv hadoop-1.0.3 hadoop
$ sudo chown -R hduser:hadoop hadoop
[/bash]
或者第三行也可以是建立到 hadoop-1.0.3 的软链接 hadoop ln -s hadoop-1.0.3.tar.gz hadoop

Update $HOME/.bashrc

将下面这些行加入到用户 hduser 的 $HOME/.bashrc 文件中。如果使用的 shell 不是 bash,同理更新其对应的配置文件。
[bash gutter="false"]
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
[/bash]

Excursus: (HDFS)

From The Hadoop Distributed File System: Architecture and Design
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the difference from other distributed file system are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.


The following picture gives an overview of the most important HDFS components.


HDFS Architecture (source: http://hadoop.apache.org/core/docs/current/hdfs_design.html)




Configuration


  1. hadoop-env.sh
    在这里,Hadoop 只需要配置 JAVA_HOME 这一个环境变量。打开 Hadoop 的环境配置文件 /usr/local/hodoop/conf/hadoop-env.conf (假设安装目录为 /usr/local/hodoop),修改 JAVA_HOME,使其指到 SUN JDK/JRE 6 的目录 。
    [bash]
    # The java implementation to use. Required.
    export JAVA_HOME=/usr/lib/jvm/jdk-1.6.0
    [/bash]
    Note: 如果使用的是 Mac X 10.7 系统,可使用如下方法设置:
    [bash]
    # for our Mac users
    export JAVA_HOME=`/usr/libexec/java_home`
    [/bash]

  2. conf/*-site.xml
    Note: As of Hadoop 0.20x and 1.x, the configuration settings previously found in hadoop-site.xml were moved to core-site.xml(hadoop.tmp.dir, fs.default.name), mapred-site.xml(mapred.job.tracker) and hdfs-site.xml(dfs.replication).


    这部分,包括配置 Hadoop 的数据存放目录,网络监听端口等。将会使用到 HDFS, 即使这里的 cluster 只包含一个节点。

    在这里,将 hadoop.tmp.dir 指到 /tmp/hadoop 目录。Hadoop 的默认配置是视其 hadoop.tmp.dir作为本地文件系统 和 HDFS 临时存储的根目录,所以,不用惊讶于在该目录下找到 Hadoop 自动的创建的特定的目录。
    [bash]
    $ sudo mkdir /tmp/hadoop
    $ sudo chown hduser:hadoop /tmp/hadoop
    # ...and if you want to tighten up security, chmod from 755 to 750...
    $ sudo chmod 750 /tmp/hadoop
    [/bash]
    如果忘记设置其属主和权限,在格式化节点时将会收到 java.io.IOExceiption 异常。



将下面 snippets 加入到指定的 XML 文件的 <configuration> ... </configuration> 标签中。

文件 conf/core-site.xml
[bash gutter="false"]
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
[/bash]

文件 conf/mapred-site.xml
[bash gutter="false"]
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
[/bash]

文件 conf/hdfs-site.xml
[bash gutter="false"]
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
[/bash]
关于配置选项,若有任何疑问,可参看 Getting Started with Hadoop 和文档 Hadoop’s API Overview

Formatting the HDFS filesystem via the NameNode

Hadoop cluster 安装的第一步是格式化实现于 cluster 本地文件系统上的 Hadoop 文件系统。
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster(in HDFS).


运行如下命令格式化 (实际上就是初始化 dfs.name.dir 选项指定的目录)
[bash]
$ /usr/local/hadoop/bin/hadoop namenode -formate
[/bash]

输出如下:
[bash gutter="false"]
[hduser@localdomain /usr/local/hadoop]$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localdomain/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
[hduser@localdomain /usr/local/hadoop]$
[/bash]


Starting your single-node cluster

执行命令:
[bash]
[hduser@localdomain ~]$ /usr/local/hadoop/bin/start-all.sh
[/bash]
这将会启动该机器上的 NameNode, DataNode, Jobtracker 和 Tasktracker。
输出如下:
[bash gutter="false"]
[hduser@localdomain ~]$ /usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-localdomain.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-localdomain.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-localdomain.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-localdomain.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-localdomain.out
[hduser@localdomain ~]$
[/bash]
一个比较简捷的方法是执行 jps (part of Sun's Java since v1.5.0)来检测 Hadoop 的进程是否有在运行。更多详细可看 How to debug MapReduce programs.
[bash]
[hduser@localdomin ~]$ /usr/local/hadoop$ jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode
[/bash]
也可使用 netstat 命令来检测 Hadoop 是否有监听配置的端口。
[bash gutter="false"]
[hduser@localdomin ~]$ sudo netstat -plten | grep java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java
tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java
tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java
tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java
tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java
tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java
tcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java
tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java
[hduser@localdomin ~]$
[/bash]
如果有任何的错误,可在 /logs/ 中查看其日志文件。

Stopping your single-node cluster

运行命令:
[bash]
$ /usr/local/hadoop/bin/stop-all.sh
[/bash]
输出类似:
[bash gutter="false"]
[hduser@localdoamin /usr/local/hadoop]$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
[hduser@localdoamin /usr/local/hadoop]$
[/bash]

Running a MapReduce job

这里使用 WordCount来做为第一个运行的MapReduce Job。这个 Job 是读取 text 文件并计算其单词出现的次数,输入和输出都是文件,更多信息查看 Hadoop wiki 上的 what happens behind the scenes.


  1. Download example input data
    在这个示例中,使用 Gutenberg 项目提供的三本电子书:

    下载每本电子书的 Plain Text UTF-8 版本,存入目录:/tmp/gutenburg (YMMV)
    [bash gutter="false"]
    [hduser@localdomain /tmp/gutenburg]$ ls -l
    -rw-r--r-- 1 hduser hadoop 674566 Aug 1 16:01 pg20417.txt
    -rw-r--r-- 1 hduser hadoop 1573150 Aug 1 16:01 pg4300.txt
    -rw-r--r-- 1 hduser hadoop 1423801 Aug 1 16:01 pg5000.txt
    [/bash]

  2. Restart the Hadoop cluster
    如果 Hadoop Cluster 没有运行,重启服务。
    [bash]
    $ /usr/local/hadoop/bin/start-all.sh
    [/bash]

  3. Copy local example data to HDFS
    在执行 MapReduce job 前,首先需将本地文件系统上的文件复制到 Hadoop 的 HDFS 中。
    [bash]
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop dfs -ls /user/hduser
    Found 1 items
    drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
    Found 3 items
    -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt
    -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt
    -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt
    [hduser@localdomin /usr/local/hadoop]$
    [/bash]


  4. Run the MapReduce job
    现在,正式开始运行 WordCount job:
    [bash]
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
    [/bash]
    上面命令将会读取 HDFS /user/hduser/gutenberg 目录中所有文件,处理后,然后将输出保存到 HDFS 的 /user/hduser/gutenberg-output 目录中。
    Note: 执行上面命令收到类似下面的错误时:
    Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar
    at org.apache.hadoop.util.RunJar.main (RunJar.java: 90)
    Caused by: java.util.zip.ZipException: error in opening zip file

    用 Hadoop example JAR 全名取代 hadoop*examples*.jar,重新执行即可, eg:
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

    输出类似如下:
    [bash gutter="false"]
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop jar hadoop-examples-1.0.3.jar.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
    10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
    10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
    10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0%
    10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0%
    10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0%
    10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100%
    10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001
    10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17
    10/05/08 17:43:28 INFO mapred.JobClient: Job Counters
    10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1
    10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3
    10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3
    10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters
    10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026
    10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
    10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918
    10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330
    10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework
    10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290
    10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286
    10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934
    10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796
    10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290
    10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874
    10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267
    10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187
    10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187
    10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286
    [/bash]
    检测运行结果是否有存入 Hadoop HDFS 中 /user/hduser/gutenberg-output 目录中:
    [bash gutter="false"]
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop dfs -ls /user/hduser
    Found 2 items
    drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
    drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
    Found 2 items
    drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs
    -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000
    [hduser@localdomin /usr/local/hadoop]$
    [/bash]
    如果想在执行时修改某些 Hadoop 的设置如增加 Reduce tasks 的个数,可直接使用 -D 选项来做修改:
    [bash]
    [hduser@localdomin /usr/local/hadoop]$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount -D mapred.reduce.tasks=16 /user/hduser/gutenberg /user/hduser/gutenberg-output
    [/bash]
    An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.


  5. Retrieve the job result from HDFS
    为了查看结果,可将 HDFS 中保存执行结果的文件复制到本地文件系统中。或者,使用如下命令直接查看:
    [bash]
    [hduser@localdomain /usr/local/hadoop]$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000
    [/bash]

    而如果需要将 HDFS 中文件复制到本地文件系统中,可参考如下:
    [bash]
    [hduser@localdomain /usr/local/hadoop]$ mkdir /tmp/gutenberg-output
    [hduser@localdomain /usr/local/hadoop]$bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output
    [hduser@localdomain /usr/local/hadoop]$head /tmp/gutenberg-output/gutenberg-output
    "(Lo)cra" 1
    "1490 1
    "1498," 1
    "35" 1
    "40," 1
    "A 2
    "AS-IS". 1
    "A_ 1
    "Absoluti 1
    "Alack! 1
    [hduser@localdomain /usr/local/hadoop]$
    [/bash]
    注意:上面head命令输出结果中的引号 ",它们不是由 Hadoop 生成的,而是 WordCount 中结果的单词标识生成器(word tokenizer)。
    fs -getmerge 命令将会将指定目录下的所有文件进行简单合并。这意味着这些被合并的文件在合并前并没有进行排序 (通常情况下都是如此)。The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might(and most likely will) not be sorted.




Hadoop Web Interfaces

默认情况下,Hadoop 有提供查看 Hadoop cluster 中详细信息的 Web 界面。具体配置在conf/hadoop-default.xml

  • http://localhost:50070/ - web UI of the NameNode daemon

  • http://localhost:50030/ - web UI of the JobTracker daemon

  • http://localhost:50060/ - web UI of the TaskTracker daemon




  1. NameNode Web Interface (HDFS layer)
    NameNode web 界面展示 cluster 的概览,包括 total/remaining capacity, live/dead nodes。另外,还可浏览 HDFS 的 namespace 及浏览其包括的文件内容和查看本地主机中 Hadoop 的日志文件的访问地址。

    默认情况下,访问地址是:http://localhost:50070/


    A screenshot of Hadoop's Name Node web interface.




  2. Job Tracker Web Interface(MapReduce layer)
    JobTracker web 界面提供查看 Hadoop cluster 执行 job 的统计信息。如 running/completed/failed jobs 和 job 的历史日志文件。另外,也提供查看本地主机 Hadoop 日志文件的访问地址。

    默认情况下,访问地址是:http://localhost:50030/


    A screenshot of Hadoop's Job Tracker web interface.




  3. TaskTracker Web Interface(MapReduce layer)
    TaskTracker web 界面显示 running/non-running tasks。它也提供了查看本地主机的 Hadoop 日志文件的访问地址。

    默认情况下,访问地址是:http://localhost:50060/


    A screenshot of Hadoop's Task Tracker web interface.




2012年8月1日星期三

安装 JDK

安装 JDK 1.7.x



  1. 下载 32bit 或 64bit 扩展名为 .tar.gz 文件,如: 32bit 的「[java-version]-i586.tar.gz」和 64bit 的「[java-version]-x64.tar.gz」。

  2. 解压下载的文件:
    [bash]
    tar -xvf jdk-7u5-linux-i586.tar.gz //32 bit
    tar -xvf jdk-7u5-linux-x64.tar.gz //64bit
    [/bash]

  3. 将解压后的文件移到 /usr/lib/jvm
    [bash]
    mv jdk1.7.0_03 /usr/lib/jvm/
    [/bash]

  4. 执行命令:
    [bash]
    sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0_03/bin/java" 1
    sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0_03/bin/javac" 1
    sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0_03/bin/javaws" 1
    [/bash]

  5. 继续执行:
    [bash]
    [root@hadoop tmp]# update-alternatives --config java

    There are 3 programs which provide 'java'.

    Selection Command
    -----------------------------------------------
    * 1 /usr/lib/jvm/jre-1.4.2-gcj/bin/java
    + 2 /usr/lib/jvm/jdk1.6.33/bin/java
    3 /usr/lib/jvm/jdk1.7.0_03/bin/java

    Enter to keep the current selection[+], or type selection number:
    [/bash]
    回车表示保持当前选择,即前面带有「+」标识的,键入 3, 回车,运行 java -version 查看当前 java 版本。
    [bash]
    [root@hadoop tmp]# java -version
    java version "1.7.0_03"
    Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
    Java HotSpot(TM) Server VM (build 22.1-b02, mixed mode)
    [/bash]

  6. 如同步骤 5 配置 javac, javaws
    [bash]
    update-alternatives --config javac
    update-alternatives --config javaws
    [/bash]



安装 JDK 1.6.x



  1. 下载 32bit 或 64bit 的扩展名为 .bin 的文件。

  2. 文件下载后,执行:
    [bash]
    chmod +x jdk-[version]-linux-i586.bin
    ./jdk-[version]-linux-i586.bin
    [/bash]
    解压的文件将会在 ./jdk1.6.0_x 目录中。例如 1.6.0.33。

  3. 将文件移至 /usr/lib/jvm
    [bash]
    mv jdk1.6.0.33 /usr/lib/jvm
    [/bash]

  4. webupd8.googlecode.com 提供了一脚本将 JDK 环境切换到 6.
    [bash]
    wget http://webupd8.googlecode.com/files/update-java-0.5b
    chmod +x update-java-0.5b
    sudo ./update-java-0.5b
    [/bash]
    update-java-0.5b0.5b 代表的是脚本本身的版本号,而不是 java 的版本。
    webupd8 ppaupdate-java是切换 JDK 的另外一个可选方案。


  5. 检测切换是否有成功:
    [bash]
    java -version
    javac -version
    [/bash]


JDK 1.6.X 如果不采用脚本切换,同样也可采用安装 1.7.X 的方式。

FROM: