2014年5月6日星期二

Chapter 5 Hardening a Hadoop Cluster

Configuring service-level authentication(SLA)

Configure Hadoop SLA:
  1. $HADOOP_HOME/conf/core-site.xml:
    
        hadoop.property.authorization
        true
    
    
  2. Allow only specific users to submit jobs to the Hadoop cluster in the $HADOOP_HOME/conf/hadoop-policy.xml:
    
        security.job.submission.protocol.acl
        hduser hadoop
        
    
    
  3. Allow only specific users and groups to talk to HDFS by opening the $HADOOP_HOME/conf/hadoop-policy.xml:
    
        security.client.protocol.acl
        hduser, hdadmin hadoop
        
    
    
  4. Allow only specific DataNodes to communicate with the NameNode in the $HADOOP_HOME/conf/hadoop-policy.xml:
  5. 
        security.datanode.protocol.acl
        datanode
        
    
    
  6. Force the NameNode to reload the ACL configurations:
    hadoop dfsadmin -refreshServiceAcl
  7. Force the JobTracker to reload the ACL configurations:
    hadoop mradmin -refreshServiceAcl
Property Seervice description
security.client.protocol.acl Client's access the HDFS
security.client.datanode.protocol.acl Client to DataNode for block recovery
security.inter.datanode.protocol.acl DataNode to DataNode updating timestamps
secruity.job.submission.protocol.acl Client to JobTracker
security.task.umbilical.protocol.acl For map and reduce tasks talk to TaskTracker
security.refresh.policy.protocol.acl dfsadmin and mradmin to refresh ACL policies

The default value of these properties is *, which means all entities can access the service or in other words, SLA is disabled.

Configuring job authorization with ACL

Hadoop provides two levels of job authorization: job level and queue level. When job authorization is eanbled, the JobTracker will authenticate users who submit jobs to the cluster. Users' operations on jobs and queues will also be authenticated by the JobTracker.

Configure job authorization with ACLs:
  1. Enable job ACL authorization in the $HADOOP_HOME/conf/mapred-site.xml:
    
        mapred.acls.enabled
        true
        
    
    
  2. Configure job authorization to only allow specific users and groups to submit the jobs in the file $HADOOP_HOME/conf/mapred-queue-acls.xml:
    mapred.queue.hdqueue.acl-submit-job
    hduser hadoop
    
  3. Configure job authorization to allow specific users and groups to manage jobs in a named queue in the the file $HADOOP_HOME/conf/mapred-queue-acls.xml:
    mapred.queue.hdqueue.acl-administer-job
    hduser hadoop
    
  4. Check the status of queue ACLs:
    hadoop queue -showacls
  5. Configure job authorization to allow only specific users and groups to view the status of a job in the $HADOOP_HOME/conf/mapred-queue-acls.xml:
    
        mapreduce.job.acl-view-job
        hduser hadoop
        
    
    
  6. Configure job authorization to allow only specific users and groups to modify a job in the $HADOOP_HOME/conf/mapred-queue-acls.xml:
    
        mapreduce.job.acl-modify-job
        hduser hadoop
    
    
  7. Force the NameNode and the JobTracker to reload the ACL configurations:
    hadoop dfsadmin -refreshServiceAcl
    hadoop mradmin -refreshServiceAcl

Job view ACLs control the access of job status information, including counters, diagnostic information, logs, job configuration, and so on.

Job modification ACLs can overlap with queue-level ACLs. When this happens, a user's operation will be granted if the user has been listed in either of these ACLs.

Securing a Hadoop cluster with Kerberos
Recent Hadoop release have added the security feature by integrating Kerberos into Hadoop. Kerberos is a network authentication protocol that provides strong authentication for client/server applications. Hadoop users Kerberos to secure data from unexpected and unauthorized accesses. It achieves this by authenticating on the underlying Remote Proceduer Calls(RPC).

Recovering from NameNode failure

The NameNode in a Hadoop cluster keeps track of the metadata for the whole HDFS fileysystem. Unfortunately, if the metadata of the NameNode is corrupted, for example, due to hard drive failure, the whole cluster will become unavailable.
master1, as the NameNode and a second machine, master2, to run the SecondaryNameNode.
Configure the NameNode to write edit logs and the filesystem image into two locations - one is on the local directory of the NameNode machine and the other is on the SecondaryNameNode machine.
  1. Configuring the following property in the $HADOOP_HOME/conf/hdfs-site.xml:
    
        dfs.name.dir
        /hadoop/dfs/name,/mnt/snn/name
        
    
    
  2. Configure the following property in the $HADOOP_HOME/conf/core-site.xml:
    
        fs.default.name
        master1:54310
    
    
  3. scp $HADOOP_HOME/conf/hdfs-site.xml master2:$HADOOP_HAOME/conf/
    scp $HADOOP_HOME/conf/slaves master2:$HADOOP_HOME/conf/
  4. Copy the configuration file to all the slave nodes in the cluster:
    for host in `cat $HADOOP_HOME/conf/slaves`; do echo 'Sync configuration files to ' $host scp $HADOOP_HOME/conf/core-site.xml $host:$HADOOP_HOME/conf done
  5. Start the Hadoop cluster in master1:
    start-all.sh
Once the NameNode on master1 failes, we can use the following steps to recover:
  1. ssh@hduser@master2
  2. ssh master1 -C "stop-all.sh"
  3. Configure the $HADOOP_HOME/conf/core-site.xml file to use master2 as the NameNode:
    
        fs.default.name
        master2:54310
    
    
  4. Copy the configurations to the slave nodes in the cluster:
    for host in `cat $HADOOP_HOME/conf/slaves`; do
        echo 'Sync configuration files to ' $host
        scp $HADOOP_HOME/conf/core-site.xml $host:$HADOOP_HOME/conf
    done
    
  5. start-all.sh

strictly, speaking, the HDFS SecondaryNameNode daemon is not NameNode. It only acts as the role of periodically fetching the filesystem metadata image file and the edit logfiles to the directory specified with property fs.checkpoint.dir. In case of NameNode failure, the backup files can be used to recover the HDFS filesystem.

NameNode resilience with multiple hard drives
Configure the NameNode with multiple hard drives:
  1. Install , format and mount the hard drive onto the machine; suppose the mount point is /hadoop1/
  2. mkdir /hadoop1/dfs/name
  3. Configure the following property in the $HAADOOP_HOME/conf/hdfs-site.xml:
    
        dfs.name.dir
        /hadoop/dfs/name, /hadoop1/dfs/name
    
    
Recover from the NameNode failure:
  1. stop-all.sh
  2. 
        dfs.name.dir
        /hadoop1/dfs/name
    
    
  3. start-all.sh
Recovering NameNode from the checkpoint of a SecondaryNameNode
  1. ssh hduser@master1
  2. Add the following line into the $HADOOP_HOME/conf/masters file:
    master2
    By doing this, we configure it to run SecondaryNameNode on master2.
  3. $HADOOP_HOEM/conf/hdfs-site.xml:
    
        dfs.name.dir
        /hadoop/dfs/name
    
    
  4. for host in `cat $HADOOP_HOME/conf/slaves`; do
        echo 'Sync configuration files to ' $host
        scp $HADOOP_HOME/conf/hdfs-site.xml $host:$HADOOP_HOME/conf
    done
    
  5. start-all.sh
In case of the NameNode fails, we can use the following steps to recover:
  1. stop-all.sh
  2. Prepare a new machine for running the NameNode.(The preparation should include properly configuring Hadoop. It is recommended that the new NameNode machine has the same configuration as the failed NameNode.
  3. hadoop fs -format
  4. scp slave1:/hadoop/dfs/data/current/VERSION* /hadoop/dfs/name/current/VERSION
  5. Copy the checkpoint image from SecondaryNameNode:
    scp master2:/hadoop/dfs/namesecondary/image/fsimage /hadoop/dfs/name/fsimage
  6. Copy the curretn editlogs from SecondaryNameNode:
    scp master2:/hadoop/dfs/namesecondary/current/* /hadoop/dfs/name/current
  7. Convert the checkpoint to the new version fromat:
    hadoop namenode -upgrade
  8. start-all.sh

Configuring NameNode high availability

  1. ssh hduser@master1
  2. Configure a logical name service in $HADOOP_CONF_DIR/hdfs-site.xml:
    
        dfs.nameservices
        hdcluster
    
    
  3. Specify the NameNode IDs for the configured name service:
    
        dfs.ha.namenode.hdculster
        namenode1,namenode2
        
    
    
  4. Configure the RPC address for namenode1 on the master1 host:
    
        dfs.namenode.rpc-address.hdcluster.namenode1
        master1:54310
    
    
  5. configure the RPC address for namenode2 on the master2 host:
    
        dfs.namenode.rpc-address.hdcluster.namenode2
        master2:54310
    
    
  6. Configure the HTTP web UI address of the two NameNodes:
    
        dfs.namenode.http-address.hdcluster.namenode1
        master1:50070
    
    
    
        dfs.namenode.http-address.hdcluster.namenode2
        master2:50070
    
    
  7. Configure the NameNode shared edits directory:
    
        dfs.namenode.shared.edits.dir
        qjournal://master1:8485;master1:8486;master2:8486;master2:8485/hdcluster
        
    
    
  8. Configure the Quorum Journal Node directory for storing edit logs in the local filesystem:
    
        dfs.journalnode.edits.dir
        hadoop/journaledits
        
    
    
  9. Configure the proxy provider for the NameNode HA:
    
        dfs.client.failover.proxy.provider.hdcluster
        org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
    
    
  10. Configure the fencing method:
    
        dfs.ha.fencing.methods
        sshfence
        
    
    
  11. Configure the private key file for the sshfence method:
    
        dfs.ha.fencing.ssh.private-key-files
        $HOME/.ssh/id_rsa
        
    
    
  12. Configure the SSH connection timeout, in millisenconds:
    
        dfs.ha.fencing.ssh.connect-timeout
        50000
    
    
  13. Enable automatic failover:
    
        dfs.ha.automatic-failover.enabled
        true
    
    
    This configuration will enable automatic failover for all the name service IDs. If we want to enable automatic failover for a specific name service ID, for example hdcluster, we can configure the following property:
    
        dfs.ha.automatic-failover.enabled.hdcluster
        true
    
    
  14. Configure the ZooKeeper servcies in the $HADOOP_CONF_DIR/core-site.xml:
    
        ha.zookeeper.quorum
        master1:2181,master2:2181
    
    
  15. Sync the configurations to all the nodes in the cluster:
    for host in `cat $HADOOP_CONF_DIR/slaves`; do
        echo 'sync configuration files to ' $host
        scp $HADOOP_CONF_DIR/hdfs-site.xml $host:$HADOOP_CONF_DIR/
        scp $HADOOP_CONF_DIR/core-site.xml $host:$HADOOP_CONF_DIR/
    done
    
  16. Initialzie the ZooKeeper:
    hdfs zkfc -formatZK
  17. start-dfs.sh
In the NameNode HA implementation, ZooKeeper is playing an important role. The security of ZooKeeper can be a necessary concern.
Configure a secured ZooKeeper:
  1. ssh hduser@master1
  2. $HADOOP_CONF_DIR/core-site.xml
    
        ha.zookeeper.auth
        @$HADOOP_CONF_DIR/zkauth.txt
        
    
    
  3. Generate ZooKeeper ACL corresponding to the authentication:
    java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-*.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider zkuser:password We will get output similar to the following:
    zkuser:password->zkuser:a4XNgljR6VhODbC7jysuQ4gBt98==
  4. Add the encrypted password to the $HADOOP_CONF_DIR/zkacl.txt file:
    digest:zkuser:a4XNgljR6VhODbC7jysuQ4gBt98=
  5. Sync the configuration to master2:
    scp $HADOOP_CONF_DIR/zkacl.txt master2:$HADOOP_CONF_DIR/
    scp $HADOOP_CONF_DIR/zkauth.txt master2:$HADOOP_CONF_DIR/
  6. Format the ZooKeeper:
    hdfs zkfc -formatZK
  7. Test the configuration :
    zkCli.sh
  8. start-dfs.sh

Configuring HDFS federation

Configure HDFS federation in the $HADOOP_CONF_DIR/hdfs-site.xml:
  1. ssh hduser@master1
  2. Specify a lsit of NameNode servcie IDs:
    
        dfs.nameservices
        namenode1,namenode2
    
    
  3. Configure the NameNode RPC and HTTP URI for namenode1:
    
        dfs.namenode.rpc-address.namenode1
        master1:54310
    
    
        dfs.namenode.http-adddress.namenode1
        master1:50070
    
    
        dfs.namenode.secondaryhttp-address.namenode1
        master1:50071
    
    
    The previous configurations assume that the NameNode daemons and NameNode HTTP and secondary HTTP daemons locate on the host master1.
  4. Specify the NameNode RPC and HTTP URI for namenode2:
    
        dfs.namenode.rpc-address.namenode2
        master2:54310
    
    
        dfs.namenode.http-address.namenode2
        master2:50070
    
    
        dfs.namenode.secondaryhttp-address.namenode2
        master2:50071
    
    
  5. Sync the configuration to all the nodes in the cluster:
    for host in `cat $HADOOP_CONF_DIR/slaves`; do
        echo 'Sync configuration files to ' $host
        scp $HADOOP_CONF_DIR/hdfs-site.xml $host:$HADOOP_CONF_DIR/
    done
    
  6. Format namenode1 on master1:
    hdfs namenode -format -clusterId hdcluster
    In above command, the -clusterId option should be a unique cluster ID in the envrionment. A unique cluster ID will be automatically generated if not specified.
  7. Similarly, format namenode2 on master2:
    hadfs namenode -format -clusterId hdcluster
    The cluster ID for this NameNode should be the same as the cluster ID specified for namenode1 in order for both NameNodes to be in the same cluster.
  8. Now, start/stop the HDFS cluster with the following commands on either of the NameNode hosts:
    start-dfs.sh
    stop-dfs.sh

Ona a non-federated HDFS cluster, all the DataNodes register with and send heartbeats to the single NameNode. On a federaetd HDFS cluster, all the DataNodes will register with all the NameNodes in the cluster, and heartbeats and block reports will be sent to these NameNodes.

A Federated HDFS cluster is composed of one or multiple namespace volumes, which consist of a namespace and a block pool that belongs to the namespace. A namespace volume is the unit of management in the cluster. For example, cluster management operations such as delete and upgrade will be operated on a namespace volumne. In addition, federated NameNodes can isolate namespaces for different applications or situations.

Daemon Property Description
NameNode dfs.namenode.rpc-address For NameNode RPC communication with clients
dfs.namenode.servicerpc-address For NameNode RPC communication with HDFS services
dfs.namenode.http-address NameNode HTTP web UI address
dfs.namenode.https-address NameNode Secured HTTP web UI address
dfs.namenode.name.dir NameNode local directory
dfs.namenode.edits.dir Local directory for NameNode edit logs
dfs.namenode.checkpoint.dir SecondaryNameNode local directory
dfs.namenode.checkpoint.edits.dir Directory for SecondaryNameNode edits logs
SecondaryNameNode dfs.secondary.namenode.keytab.file The SecondaryNameNode keytab file
dfs.namenode.backup.address The address for the backup node
BuckupNode dfs.secondary.namenode.keytab.file Backup node keytab file
Decommissioning a NameNode from the cluster
Add the NameNode ID into the $HADOOP_CONF_DIR/namenode_exclude.txt file. for example, if we want to decommission namenode1 from the cluster, the content of the file should be:
namenode1
Distribute the exclude file to all the NameNodes:
distributed-exclude.sh $HADOOP_CONF_DIR/namenode_exclude.txt
Refresh the NameNode list:
refresh-namenode.sh
Running balancer
hadoop-daemon.sh --config $HADOOP_HOME/conf --script hdfs start balancer -policy node
This command will balance the data blocks on the node level. Another balancing policy is blockpool, which balances the storage at the block pool level as well as the data node level.
Adding a new NameNode
  1. hduser@master3
  2. Configure MRv2 on the master3 node.
  3. Add the following lines into the $HADOOP_CONF_DIR/hdfs-site.xml file:
    
        dfs.nameservcies
        namenode1,namenode2,namenode3
    
    
        dfs.namenode.rpc-address.namenode1
        master1:54310
    
    
        dfs.namenode.http-adddress.namenode1
        master1:50070
    
    
        dfs.namenode.secondaryhttp-address.namenode1
        master1:50071
    
    
    
        dfs.namenode.rpc-address.namenode1
        master2:54310
    
    
        dfs.namenode.http-adddress.namenode1
        master2:50070
    
    
        dfs.namenode.secondaryhttp-address.namenode1
        master2:50071
    
    
    
        dfs.namenode.rpc-address.namenode1
        master3:54310
    
    
        dfs.namenode.http-adddress.namenode1
        master3:50070
    
    
        dfs.namenode.secondaryhttp-address.namenode1
        master3:50071
    
    
  4. Format namenode3:
    hdfs namenode -format -clusterId hdcluster
  5. Sync the configuration into all the other NameNodes:
    scp $HADOOP_CONF_DIR/hdfs-site.xml master1:$HADOOP_CONF_DIR/
    scp $HADOOP_CONF_DIR/hdfs-site.xml master2:$HADOOP_CONF_DIR/
  6. Sync the configuration into all the slave nodes in the cluster:
    for host in `cat $HADOOP_CONF_DIR/slaves`; do
        echo 'Sync configuration files to ' $host
        scp $HADOOP_CONF_DIR/hdfs-site.xml $host:$HADOOP_CONF_DIR/
    done
    
  7. start-dfs.sh
  8. Tell the DataNodes the change of the NameNodes:
    for host in `cat $HADOOP_CONF_DIR/slaves`; do
        echo 'Processing on host' $host
        ssh $host -C "hdfs dfsadmin -refreshNameNode master3:54310"
    done
    

没有评论 :

发表评论