生活在别处: Chapter 5 Hardening a Hadoop Cluster

Configuring service-level authentication(SLA)

Configure Hadoop SLA:

$HADOOP_HOME/conf/core-site.xml:


    hadoop.property.authorization
    true

Allow only specific users to submit jobs to the Hadoop cluster in the $HADOOP_HOME/conf/hadoop-policy.xml:
```
    security.job.submission.protocol.acl
    hduser hadoop
    
```
Allow only specific users and groups to talk to HDFS by opening the $HADOOP_HOME/conf/hadoop-policy.xml:
```
    security.client.protocol.acl
    hduser, hdadmin hadoop
    
```
Allow only specific DataNodes to communicate with the NameNode in the $HADOOP_HOME/conf/hadoop-policy.xml:


    security.datanode.protocol.acl
    datanode

Force the NameNode to reload the ACL configurations:
hadoop dfsadmin -refreshServiceAcl
Force the JobTracker to reload the ACL configurations:
hadoop mradmin -refreshServiceAcl

Property	Seervice description
security.client.protocol.acl	Client's access the HDFS
security.client.datanode.protocol.acl	Client to DataNode for block recovery
security.inter.datanode.protocol.acl	DataNode to DataNode updating timestamps
secruity.job.submission.protocol.acl	Client to JobTracker
security.task.umbilical.protocol.acl	For map and reduce tasks talk to TaskTracker
security.refresh.policy.protocol.acl	dfsadmin and mradmin to refresh ACL policies

The default value of these properties is *, which means all entities can access the service or in other words, SLA is disabled.

Configuring job authorization with ACL

Hadoop provides two levels of job authorization: job level and queue level. When job authorization is eanbled, the JobTracker will authenticate users who submit jobs to the cluster. Users' operations on jobs and queues will also be authenticated by the JobTracker.

Configure job authorization with ACLs:

Enable job ACL authorization in the $HADOOP_HOME/conf/mapred-site.xml:
```
    mapred.acls.enabled
    true
    
```
Configure job authorization to only allow specific users and groups to submit the jobs in the file $HADOOP_HOME/conf/mapred-queue-acls.xml:
```
mapred.queue.hdqueue.acl-submit-job
hduser hadoop
```
Configure job authorization to allow specific users and groups to manage jobs in a named queue in the the file $HADOOP_HOME/conf/mapred-queue-acls.xml:
```
mapred.queue.hdqueue.acl-administer-job
hduser hadoop
```
Check the status of queue ACLs:
hadoop queue -showacls
Configure job authorization to allow only specific users and groups to view the status of a job in the $HADOOP_HOME/conf/mapred-queue-acls.xml:
```
    mapreduce.job.acl-view-job
    hduser hadoop
    
```
Configure job authorization to allow only specific users and groups to modify a job in the $HADOOP_HOME/conf/mapred-queue-acls.xml:
```
    mapreduce.job.acl-modify-job
    hduser hadoop
```
Force the NameNode and the JobTracker to reload the ACL configurations:
hadoop dfsadmin -refreshServiceAcl
hadoop mradmin -refreshServiceAcl

Job view ACLs control the access of job status information, including counters, diagnostic information, logs, job configuration, and so on.

Job modification ACLs can overlap with queue-level ACLs. When this happens, a user's operation will be granted if the user has been listed in either of these ACLs.

Securing a Hadoop cluster with Kerberos

Recent Hadoop release have added the security feature by integrating Kerberos into Hadoop. Kerberos is a network authentication protocol that provides strong authentication for client/server applications. Hadoop users Kerberos to secure data from unexpected and unauthorized accesses. It achieves this by authenticating on the underlying Remote Proceduer Calls(RPC).

Recovering from NameNode failure

The NameNode in a Hadoop cluster keeps track of the metadata for the whole HDFS fileysystem. Unfortunately, if the metadata of the NameNode is corrupted, for example, due to hard drive failure, the whole cluster will become unavailable.
master1, as the NameNode and a second machine, master2, to run the SecondaryNameNode.
Configure the NameNode to write edit logs and the filesystem image into two locations - one is on the local directory of the NameNode machine and the other is on the SecondaryNameNode machine.

Configuring the following property in the $HADOOP_HOME/conf/hdfs-site.xml:


    dfs.name.dir
    /hadoop/dfs/name,/mnt/snn/name

Configure the following property in the $HADOOP_HOME/conf/core-site.xml:
```
    fs.default.name
    master1:54310
```
scp $HADOOP_HOME/conf/hdfs-site.xml master2:$HADOOP_HAOME/conf/
scp $HADOOP_HOME/conf/slaves master2:$HADOOP_HOME/conf/
Copy the configuration file to all the slave nodes in the cluster:
for host in `cat $HADOOP_HOME/conf/slaves`; do echo 'Sync configuration files to ' $host scp $HADOOP_HOME/conf/core-site.xml $host:$HADOOP_HOME/conf done
Start the Hadoop cluster in master1:
start-all.sh

Once the NameNode on master1 failes, we can use the following steps to recover:

ssh@hduser@master2
ssh master1 -C "stop-all.sh"
Configure the $HADOOP_HOME/conf/core-site.xml file to use master2 as the NameNode:
```
    fs.default.name
    master2:54310
```

Copy the configurations to the slave nodes in the cluster:

for host in `cat $HADOOP_HOME/conf/slaves`; do
    echo 'Sync configuration files to ' $host
    scp $HADOOP_HOME/conf/core-site.xml $host:$HADOOP_HOME/conf
done

start-all.sh

strictly, speaking, the HDFS SecondaryNameNode daemon is not NameNode. It only acts as the role of periodically fetching the filesystem metadata image file and the edit logfiles to the directory specified with property fs.checkpoint.dir. In case of NameNode failure, the backup files can be used to recover the HDFS filesystem.

NameNode resilience with multiple hard drives

Configure the NameNode with multiple hard drives:

Install , format and mount the hard drive onto the machine; suppose the mount point is /hadoop1/
mkdir /hadoop1/dfs/name
Configure the following property in the $HAADOOP_HOME/conf/hdfs-site.xml:
```
    dfs.name.dir
    /hadoop/dfs/name, /hadoop1/dfs/name
```

Recover from the NameNode failure:

stop-all.sh


    dfs.name.dir
    /hadoop1/dfs/name

start-all.sh

Recovering NameNode from the checkpoint of a SecondaryNameNode

ssh hduser@master1
Add the following line into the $HADOOP_HOME/conf/masters file:
master2
By doing this, we configure it to run SecondaryNameNode on master2.

$HADOOP_HOEM/conf/hdfs-site.xml:


    dfs.name.dir
    /hadoop/dfs/name

for host in `cat $HADOOP_HOME/conf/slaves`; do
    echo 'Sync configuration files to ' $host
    scp $HADOOP_HOME/conf/hdfs-site.xml $host:$HADOOP_HOME/conf
done

start-all.sh

In case of the NameNode fails, we can use the following steps to recover:

stop-all.sh
Prepare a new machine for running the NameNode.(The preparation should include properly configuring Hadoop. It is recommended that the new NameNode machine has the same configuration as the failed NameNode.
hadoop fs -format
scp slave1:/hadoop/dfs/data/current/VERSION* /hadoop/dfs/name/current/VERSION
Copy the checkpoint image from SecondaryNameNode:
scp master2:/hadoop/dfs/namesecondary/image/fsimage /hadoop/dfs/name/fsimage
Copy the curretn editlogs from SecondaryNameNode:
scp master2:/hadoop/dfs/namesecondary/current/* /hadoop/dfs/name/current
Convert the checkpoint to the new version fromat:
hadoop namenode -upgrade
start-all.sh

Configuring NameNode high availability

ssh hduser@master1
Configure a logical name service in $HADOOP_CONF_DIR/hdfs-site.xml:
```
    dfs.nameservices
    hdcluster
```

Specify the NameNode IDs for the configured name service:


    dfs.ha.namenode.hdculster
    namenode1,namenode2

Configure the RPC address for namenode1 on the master1 host:


    dfs.namenode.rpc-address.hdcluster.namenode1
    master1:54310

configure the RPC address for namenode2 on the master2 host:


    dfs.namenode.rpc-address.hdcluster.namenode2
    master2:54310

Configure the HTTP web UI address of the two NameNodes:


    dfs.namenode.http-address.hdcluster.namenode1
    master1:50070



    dfs.namenode.http-address.hdcluster.namenode2
    master2:50070

Configure the NameNode shared edits directory:


    dfs.namenode.shared.edits.dir
    qjournal://master1:8485;master1:8486;master2:8486;master2:8485/hdcluster

Configure the Quorum Journal Node directory for storing edit logs in the local filesystem:
```
    dfs.journalnode.edits.dir
    hadoop/journaledits
    
```

Configure the proxy provider for the NameNode HA:


    dfs.client.failover.proxy.provider.hdcluster
    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

Configure the fencing method:


    dfs.ha.fencing.methods
    sshfence

Configure the private key file for the sshfence method:


    dfs.ha.fencing.ssh.private-key-files
    $HOME/.ssh/id_rsa

Configure the SSH connection timeout, in millisenconds:


    dfs.ha.fencing.ssh.connect-timeout
    50000

Enable automatic failover:
```
    dfs.ha.automatic-failover.enabled
    true
```
This configuration will enable automatic failover for all the name service IDs. If we want to enable automatic failover for a specific name service ID, for example hdcluster, we can configure the following property:
```
    dfs.ha.automatic-failover.enabled.hdcluster
    true
```
Configure the ZooKeeper servcies in the $HADOOP_CONF_DIR/core-site.xml:
```
    ha.zookeeper.quorum
    master1:2181,master2:2181
```

Sync the configurations to all the nodes in the cluster:

for host in `cat $HADOOP_CONF_DIR/slaves`; do
    echo 'sync configuration files to ' $host
    scp $HADOOP_CONF_DIR/hdfs-site.xml $host:$HADOOP_CONF_DIR/
    scp $HADOOP_CONF_DIR/core-site.xml $host:$HADOOP_CONF_DIR/
done

Initialzie the ZooKeeper:
hdfs zkfc -formatZK
start-dfs.sh

In the NameNode HA implementation, ZooKeeper is playing an important role. The security of ZooKeeper can be a necessary concern.
Configure a secured ZooKeeper:

ssh hduser@master1

$HADOOP_CONF_DIR/core-site.xml


    ha.zookeeper.auth
    @$HADOOP_CONF_DIR/zkauth.txt

Generate ZooKeeper ACL corresponding to the authentication:
java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-*.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider zkuser:password We will get output similar to the following:
zkuser:password->zkuser:a4XNgljR6VhODbC7jysuQ4gBt98==
Add the encrypted password to the $HADOOP_CONF_DIR/zkacl.txt file:
digest:zkuser:a4XNgljR6VhODbC7jysuQ4gBt98=
Sync the configuration to master2:
scp $HADOOP_CONF_DIR/zkacl.txt master2:$HADOOP_CONF_DIR/
scp $HADOOP_CONF_DIR/zkauth.txt master2:$HADOOP_CONF_DIR/
Format the ZooKeeper:
hdfs zkfc -formatZK
Test the configuration :
zkCli.sh
start-dfs.sh

Configuring HDFS federation

Configure HDFS federation in the $HADOOP_CONF_DIR/hdfs-site.xml:

ssh hduser@master1

Specify a lsit of NameNode servcie IDs:


    dfs.nameservices
    namenode1,namenode2

Configure the NameNode RPC and HTTP URI for namenode1:


    dfs.namenode.rpc-address.namenode1
    master1:54310


    dfs.namenode.http-adddress.namenode1
    master1:50070


    dfs.namenode.secondaryhttp-address.namenode1
    master1:50071

The previous configurations assume that the NameNode daemons and NameNode HTTP and secondary HTTP daemons locate on the host master1.

Specify the NameNode RPC and HTTP URI for namenode2:


    dfs.namenode.rpc-address.namenode2
    master2:54310


    dfs.namenode.http-address.namenode2
    master2:50070


    dfs.namenode.secondaryhttp-address.namenode2
    master2:50071

Sync the configuration to all the nodes in the cluster:

for host in `cat $HADOOP_CONF_DIR/slaves`; do
    echo 'Sync configuration files to ' $host
    scp $HADOOP_CONF_DIR/hdfs-site.xml $host:$HADOOP_CONF_DIR/
done

Format namenode1 on master1:
hdfs namenode -format -clusterId hdcluster
In above command, the -clusterId option should be a unique cluster ID in the envrionment. A unique cluster ID will be automatically generated if not specified.
Similarly, format namenode2 on master2:
hadfs namenode -format -clusterId hdcluster
The cluster ID for this NameNode should be the same as the cluster ID specified for namenode1 in order for both NameNodes to be in the same cluster.
Now, start/stop the HDFS cluster with the following commands on either of the NameNode hosts:
start-dfs.sh
stop-dfs.sh

Ona a non-federated HDFS cluster, all the DataNodes register with and send heartbeats to the single NameNode. On a federaetd HDFS cluster, all the DataNodes will register with all the NameNodes in the cluster, and heartbeats and block reports will be sent to these NameNodes.

A Federated HDFS cluster is composed of one or multiple namespace volumes, which consist of a namespace and a block pool that belongs to the namespace. A namespace volume is the unit of management in the cluster. For example, cluster management operations such as delete and upgrade will be operated on a namespace volumne. In addition, federated NameNodes can isolate namespaces for different applications or situations.

Daemon	Property	Description
NameNode	dfs.namenode.rpc-address	For NameNode RPC communication with clients
	dfs.namenode.servicerpc-address	For NameNode RPC communication with HDFS services
	dfs.namenode.http-address	NameNode HTTP web UI address
	dfs.namenode.https-address	NameNode Secured HTTP web UI address
	dfs.namenode.name.dir	NameNode local directory
	dfs.namenode.edits.dir	Local directory for NameNode edit logs
	dfs.namenode.checkpoint.dir	SecondaryNameNode local directory
	dfs.namenode.checkpoint.edits.dir	Directory for SecondaryNameNode edits logs
SecondaryNameNode	dfs.secondary.namenode.keytab.file	The SecondaryNameNode keytab file
SecondaryNameNode	dfs.namenode.backup.address	The address for the backup node
BuckupNode	dfs.secondary.namenode.keytab.file	Backup node keytab file

Decommissioning a NameNode from the cluster

Add the NameNode ID into the $HADOOP_CONF_DIR/namenode_exclude.txt file. for example, if we want to decommission namenode1 from the cluster, the content of the file should be:

namenode1

Distribute the exclude file to all the NameNodes:
distributed-exclude.sh $HADOOP_CONF_DIR/namenode_exclude.txt
Refresh the NameNode list:
refresh-namenode.sh

Running balancer

hadoop-daemon.sh --config $HADOOP_HOME/conf --script hdfs start balancer -policy node
This command will balance the data blocks on the node level. Another balancing policy is blockpool, which balances the storage at the block pool level as well as the data node level.

Adding a new NameNode

hduser@master3
Configure MRv2 on the master3 node.

Add the following lines into the $HADOOP_CONF_DIR/hdfs-site.xml file:


    dfs.nameservcies
    namenode1,namenode2,namenode3


    dfs.namenode.rpc-address.namenode1
    master1:54310


    dfs.namenode.http-adddress.namenode1
    master1:50070


    dfs.namenode.secondaryhttp-address.namenode1
    master1:50071



    dfs.namenode.rpc-address.namenode1
    master2:54310


    dfs.namenode.http-adddress.namenode1
    master2:50070


    dfs.namenode.secondaryhttp-address.namenode1
    master2:50071



    dfs.namenode.rpc-address.namenode1
    master3:54310


    dfs.namenode.http-adddress.namenode1
    master3:50070


    dfs.namenode.secondaryhttp-address.namenode1
    master3:50071

Format namenode3:
hdfs namenode -format -clusterId hdcluster
Sync the configuration into all the other NameNodes:
scp $HADOOP_CONF_DIR/hdfs-site.xml master1:$HADOOP_CONF_DIR/
scp $HADOOP_CONF_DIR/hdfs-site.xml master2:$HADOOP_CONF_DIR/

Sync the configuration into all the slave nodes in the cluster:

for host in `cat $HADOOP_CONF_DIR/slaves`; do
    echo 'Sync configuration files to ' $host
    scp $HADOOP_CONF_DIR/hdfs-site.xml $host:$HADOOP_CONF_DIR/
done

start-dfs.sh

Tell the DataNodes the change of the NameNodes:

for host in `cat $HADOOP_CONF_DIR/slaves`; do
    echo 'Processing on host' $host
    ssh $host -C "hdfs dfsadmin -refreshNameNode master3:54310"
done

生活在别处

2014年5月6日星期二

Chapter 5 Hardening a Hadoop Cluster