Benchmarkign and profiling a Hadoop Cluster
Perform HDFS benchmarks:- hadoop jar $HADOOP_HOME/hadoop-test-*.jar testfilesystem -files 10 -megaBytes 10:
This command will generate 10 files with 10MB each for testing. - Benchmark the distributed write consistency on the distributed filesystem:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar DistributedFSCheck -write -nrFiles 10 -fileSize 50
This command will write 10(controlled by the -nrFiles option) files of 50 MB (controlled by the -fileSize option) with random content to the HDFS. - Similarly, we can benchmark the distributed read consistency on the distributed filesystem:
hadooop jar $HADOOP_HOME/hadoop-test-*.jar DistributedFSCheck -read -nrFiles 10 -fileSize 50
This command will read 10 files with the size of 50MB from the cluster and will generate a result file.
From the output of the two(read/write consistency check) commands, we know that writing is more expensive than reading for the HDFS cluster. This is because the write operations, in turn, involve more operations, such as computing and recordign the checksums of the data blocks and many more.
Perform MapReduce clsuter benchmarks
- Benchmark MapReduce jobs:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar mapredtest 5 1000
The mapredtest benchmark does a load test on the MapReduce computing framework. This benchmark is done with random integers, which are generated written to files, read read back from files, and tested with the original files. - Test the reliability of the MapReduce distributing computing framework:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar MRReliabilityTest -libjars $HADOOP_HOME/hadoop-examples-*.jar
This command will intentionally cause task and TaskTracker failures on a running job.
The failed tasks will be re-executed on a diffferent TaksTracker. So, killing one or a few tasks should not fail a job if the cluster is resilient to failures. If a job fails after a few killed tasks, it is possibly because MapReduce is not reliable enough or not resilient to failures, and hence reliability tuning(such as by adding more computing TaskTrackers) is needed. - Benchmark MapReduce to deal with a large number of small jobs:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar mrbench -numRuns 20
mrbench executes a small job a number of times(20 in this command) and checks if these small jobs are responsive and can run effciently on the cluster. - Benchmark the MapReduce load generator:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar loadgen -m 100 -r 20 -keepmap 50 -keepred 50 -indir input -outdir output - Do a stress test with the NameNode:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar nnbench -create_write - Test the Hadoop performance with large non-splittable files:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar testbigmapoutput -input input -output output -create 2048 - Test thread map spills:
hadoop jar $HADOOP_HOME/hadoop-test-*.jar threadedmapbench
threadedmapbench is a MapReduce benchmark that compare the performance of maps with multiple spills to maps with one spill.
Benchmark Hadoop sort
- Generate some random text data:
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar randomwriter random.writer.out - Sort the generated random data:
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar sort random.writer.out random.writer.out.sorted - Validate the MapReduce sort algorithm:
hadoop jar $HADOOOP_HOME/hadoop-test-*.jar tsetmapredsort -m 50 -r 5 -sortInput random.writer.out -sortOutput random.writer.out.sorted
This command will validate the accuracy of the sort algorithm.
Analyzign job history with Rumen
Current Rumen implementation includes two components: TraceBuilder and foler. The TraceBuilder takes job history as input and generates easily parsed json files. The foler is a utility to manipulate on input traces, and, most of the time, it is used to scale the summarized job traces from the TraceBuilder.- Use the TrackeBuilder to extract the Gold Trace from the Hadoop job history files. The syntax of the command:
hadoop org.apche.hadoop.tools.rumen.TraceBuilder [options]
hadoop org.apache.hadoop.tools.rumen.TraceBuilder -recursive file:///tmp/jobtraces.json file:///tmp/topology.out file:///usr/local/hadoop/logs/history/done - The second step of using Rumen is to scale the data generated from the previous step:
hadoop org.apche.hadoop.tools.rumen.Folder [options] [input] [output]
For example, to scale the runtime of the job trace generated in the previous step to 50 minutes:
hadoop org.apache.hadoop.tools.rumen.Folder -output-duration 50m -input-cycle 20m file:///tmp/jobtraces.json file:///tmp/job-scaled-50min.json
Benchmarking a Hadoop cluster with GridMix
GridMix is a tool for benchmarking Hadoop clusters. It generates a number of synthetic MapReduce jobs and builds a model based on the performance of these jobs. Resource profiles of the cluster are modeled based on the job execution metrics. ssh hduser@master- cd $HADOOP_HOME/src/benchmark/gridmix2
- ant
- cp build/gridmix.jar
- Open the gridmix-env-2 file:
export HADOOP_VERSION=hadoop-1.1.2 export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/conf export USE_REAL_DATASET=FALSE export APP_JAR=${HADOOP_hOME}/hadoop-test-*.jar export EXAMPLE_JAR=${HADOOP_HOME}/hadoop-examples-*.jar export STREAMING_JAR=${HADOOP_HOME}/contrib/streaming/hadoop-streaming-*.jar - open the gridmix_config.xml GridMix2 configuration file and change the benchmark configuration by changing the properties for the benchmark.
- Make the generateGridmix2Data.sh script executable:
chmod +x generateGridmix2Data.sh - Generate data with the following command:
./generateGridmix2Data.sh
This command will generate data on HDFS. By default the generated data will be compressed with a block compression ratio of 4. Three jobs will be started. - chmod +x rungridmix-2
- ./rungridmix_2
These two properties specify that we will use 10 small stream sort jobs with 6 reducers, and 5 small stream sort jobs with 3 reducers.streamSort.smallJobs.numOfJobs 10,5 streamSort.smallJobs.numOfReduces 6,3
The data required by GridMix is generated by running the generatedGridmix2data.sh script. We can configure this file to change, for example, the size of the generated adta file. Then , when executing the rungridmix_2 script, a number of jobs will be genearted and submitted in batch mode. In the end, the running time of these jobs will be computed. GridMix2 is shipped with the following representative jobs: streamSort, javaSort, webdataSort, combiner, and monsterSort. These jobs can be classfied int to the following categories:
- A three-stage MapReduce job, which is motivated by the multistage or pipelined MapReduce jobs.
- Large sort of variable key/value sizes, which is motivated by the processing of large datasets.
- Reference select jobs, which is motivated by jobs that sample from a large, reference data set.
- API text sort jobs, which is motivated by the application of MapReduce APIs for sorting.
Benchmarking Hadoop cluster with GridMix1
- cd $HADOOP_HOME/src/benchmarks/gridmix
- Change the configuration in the gridmix-env file:
export HADOOP_HOME=/usr/local/hadoop export GRID_MIX_HOME=$HADOOP_HOME/src/benchmarks/gridmix export APP_JAR=${HADOOP_HOME}/hadoop-test-*.jar export EXAMPLE_JAR=${HADOOP_HOME}/hadoop-examples-*.jar export STREAMING_JAR=${HADOOP_HOME}/contrib/streaming/hadoop-streaming-*.jar export GRID_MIX_DATA=/gridmix1/data export GRID_MIX_PROG=/gridmix1/programsThe last two environment variables GRID_MIX_DATA and GRID_MIX_PROG specify two directories on HDFS. So, the generated data and programs will be on HDFS. - chmod +x generateData.sh
- sh ./generateData.sh
- Run a small javasort job:
sh javasort/text-sort.small
Benchmarking Hadoop cluster with GridMix3
- Copy the required JAR files to the Hadoop lib directory:
cp $HADOOP_HOME/hadoop-tools-*.jar $HADOOP_HOME/lib
cp $HADOOP_HOME/contrib/gridmix/hadoop-gridmix-*.jar $HADOOP_hOME/lib
The hadoop-tools-*.jar file contains tools such as Rumen, which is needed by GridMix3. And the hadoop-gridmix-*.jar file contains the GridMix3 benchmark tool. In addition, the GridMix3 job mix for a Hadoop cluster is typically described with a job trace file, which is generated from job configuration files using Rumen. - Use Rumen to generate a job trace file:
hadoop org.apache.hadoop.tools.rumen.TraceBuilder -recursive file:///tmp/jobtrace.son file:///tmp/topology.out file:///usr/local/hadoop/logs/history/done - Run the GridMix3 benchmark:
hadoop org.apche.hadoop.mapred.gridmix.Gridmix -generated 100m gridmixdata /tmp/jobtrace.json - To acquire the usage and available parameters for GridMix3:
hadoop org.apche.hadoop.mapred.gridmix.Gridmix
Using Hadoop Vaidya to identify performance problems
Hadoop Vaidya is an open source, rule-based performace diagnostic framework for Apache Hadoop. Each rule can identify a specific performance problem. For example, Hadoop cluster administarators can use Vaida to identify slwo progressing jobs that are wasting cluster resources. Hadoop clusters can use Vaidya to identify configuration mistakes for their submitted jobs.Perform the followings to use Hadoop Vaidya on master:
- Locate the directory for the job configuration file you want to analyze. The default location of this folder is $HADOOP_HOME/logs
- Locate the job history files under the job history directory:
find $HADOOP_HOME/logs -name 'job_201304012330_0001*' - Use Vaidya to analyze the job trace files:
sh $HADOOP_HOME/contrib/vaidya/bin/vaidya.sh -jobconf file:///use/local/hadoop/logs/history/job_20130412330_0002_conf.xml -joblog file:///usr/local/hadoop/logs/history/job_20130412330_002_1364874504561_hduer_TeraGen -report report.txt
Hibench is a benchmarking suite for Hadoop. It has nine typical workloads, including mircro, HFS, web search machine learning, and data analytics benchmarks. For example, it supports benchmarks for Nutch(a text indexing software package), PageRank(the PageRank algorithm), the Mahout machine learning argorithms and Hive queries.
Balancing data blocks for a Hadoop cluster
HDFS stores data blocks on DataNode machines. When Hadoop processes jobs, data is generated and deleted. Over time, some DataNodes can host much more data blocks than others. This unbalanced distribution of data on the cluster is called data skew.Data skew is a big problem for a Hadoop cluster. We know that when the JobTracker assigns tasks to TaskTrackers, it follows the general rule of being data local, which means the map tasks will be assigned to those hosts where data blocks reside in. If the data block storage distribution is skewed, or in other words, the data blocks locate only on a small percentage of DataNodes, only those nodes with data blocks can follow the data local rule. Also, if JobTracker assigns tasks to other nodes that do not have data hosted locally, the data needs to be transferred from remote machines to the TaskTracker machine. The data transfer will cost a large amount of network bandwidth, downgrading the overall performance of the cluster.
Balance HDFS data blocks with balancer:
- Check the data skew through the web UI.
- Use the following command to balance the data blocks on the DataNode machines:
hadoop balancer -threshold 0.2
This command will take some time to finish depending on the status of the distributioned filesystem as well as the value for the --threshold option. The -threshold option specifies the threshold for whether the cluster is balanced. It is a real number within ragnge [0, 1] with a default value of 0.1. A smaller value for this option leads to more even distribution of the data blocks. On the other hand, it will require more time to finish. Setting this option to be 0 is not recommended because it is not practical to achieve an ideal balance. - Alternatively, we can start the Hadoop balancer daemon to automatically balance the data blocks on HDFS.
start-balancer.sh
The balancer will move data blocks among the DataNodes according to the space utilization.
PS:
start-balancer.sh 可以执行 -threshold 参数。 -threshold 参数是指定平衡的阈值,默认是 10, 即每个 DataNode 节点的实际 HDFS 存储使用量/集群 HDFS 存储量 。
eg: DataNode HDFS 使用量1.2G; 集群总 hdfs存储量 10T 即 10000G; 则t值为1.2/10000 = 0.00012; 当执行 balance 的 -threshold 参数小于 0.00012 时,集群进行balance;所以命令可为:start-balancer.sh -threshold 0.0001
start-balancer.sh 命令可在 NameNode 或 DataNode 启动; 可以随时停止 stop-balancer.sh。 默认带宽是 1M/s。
The Hadoop balancer balances data blocks on HDFS according to a preconfigured threshold value, which sets the target for whether the cluster is balanced or not. A node id considered balanced if the difference between space utilization of the node and space utilization of the cluster is less than the threshold.
Sometimes, we want to limit the percentage of bandwidth used by the balancer. By default, Hadoop defines a property dfs.balance.bandwidthPerSec, which determines the maximum speed that a data block will be moved from one DataNode to another. Its default value is 1MB/s. By configuring this property to be a higher value, the balancing speed will be faster, but more resources will be used. For example, to change the value of this property to be 10MB/s, we can change this in the $HADOOP_HOME/conf/hdfs-site.xml.
Choosing a Proper Block Size
HDFS stores data as data blocks distributed on multiple machines. So, when a large file is put onto HDFS, it will first be split into a number of data blocks. These data blocks are then distributed by the NameNode to the DataNodes in the cluster. The granularity of the data blocks can affect the distribution and parallel execution of the tasks.Based on the property of the jobs being executed, one block size might result in better performance than others.
Configure the proper HDFS block size:
- Run a typical job on the configured cluster.
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar terasort input output - Use Rumen to generate job traces from the job history file and the job logfile:
hadoop org.apache.hadoop.tools.rumen.TraceBuilder file:///tmp/jobtraces.json file:///tmp/topology.out file:///usr/local/hadoop/logs/history/done/job_201304012206_0002_conf.xml file:///usr/local/hadoop/logs/history/done/job_201304012206_0002_log - Use GridMix3 to generate a Hadoop cluster benchmark with different block size:
hadoop org.apche.hadoop.mapred.gridmix.GridMix -generate 10m input jobtraces.json - stop-all.sh
- Change the dfs.block.size in the $HADOOP_hOME/conf/hdfs-site.xml.
- stop-all.sh
Using Compression for input and output
A typical MapReduce job uses parallel mapper tasks to load data from external storage devices, such as hard drives to the main memory. When a job finished, the reduce tasks write the result data back to the hard drive. In this way, during the life cycle of a MapReduce job, many data copies are creted when dat is relayed between the hard drive and the main memroy. Sometimes, the data is copied over the network from a remote node.Copying data from and to hard drives and transfers over the network are expensive operations. To reeuce the cost of these operations, Hadoop introduced compression on the data.
Data compression in Hadoop is done by a compression codec, which is a program that encodes and decodes data streams. Although compression and decompression can cause addditional cost to the system, the advantages far outweigh the disadvantages.
Configure input and output data compression in the $HADOOP_HOME/conf/mapred-site.xml on the Hadoop master machine.
- stop-all.sh
- Enable output compression:
mapred.output.compress true - Specify the output compression codec:
mapred.output.compression.codec org.apache.hadoop.io.compress.GzipCodec - Change the ouptut compression type for sequence file output:
mapred.output.compression.type BLOCK - Configure teh map output compression:
mapred.compress.map.output true - Similar to the codec configuration for the MapReduce job output, we can perform confiugration compression codecs for the map task output, the default of which is org.apche.hadoop.io.compress.DefaultCodec.
mapred.map.output.compression.codec org.apache.hadoop.io.compress.GzipCodec - Copy the configuration file from the master node to all the slave nodes:
for host in `cat $HADOOP_HOME/conf/slaves` do echo 'Copying map-site.xml file to host:' $host scp $HADOOP_HOME/conf/mapred-site.xml $host:$HADOOP_HOME/conf/ done
- start-all.sh
| Codec name | Java class |
| DefaultCodec | org.apache.hadoop.io.compress.DefaultCodec |
| GzipCodec | org.apache.hadoop.io.compress.GzipCodec |
| BZip2Codec | org.apache.hadoop.io.compress.BZip2Codec |
| SnappyCodec | org.apache.hadoop.io.compress.SnappyCodec |
| LzoCodec | org.apache.hadoop.io.compress.LzoCodec |
Configuring Speculative Execution
Speculative Execution is a proactive performance boosting strategy used by JobTracker to execute one task on two TaskTracker instances. When either of these tasks finishes, the other tasks will be killed. By default, speculative execution is on.Speculative execution can be helpful to improve the performance of MapReduce jobs by reducing the execution time for slowly progressing tasks. For example, on heterogeneous Hadoop cluster with different hardware configurations, low performance computing nodes can greatly prolong the execution time of a MapReduce job. Speculative execution can remedy this problem by prioritizing the high performance nodes for MapReduce tasks execution. Hence, the MapReduce execution time can be shortened.
On the other hand, speculative execution can negatively affect the performance of the cluster when a lot of resources are used for speculative execution. For example, many tasks will have to wait for slots that are used for speculative execution.
Configure Hadoop speculative execution in the $HADOOOP_HOME/conf/mapred-site.xml:
- stop-mared.sh
- Disable the map task speculative execution :
mapred.map.tasks.speculative.exeution false - Disable the reduce task speculative execution:
mapred.reduce.tasks.speculative.execution false - Configuring the maximun percentage of concurrently running speculative tasks:
mapreduce.job.speculative.speculativecap 0.2 - Configuring the job speculative execution threshold for slow tasks:
mapreduce.job.speculative.slowtaskthreshold 1.0 - Configure the threshold for a TaskTracker to speculatively execute slow tasks:
mapreduce.job.speculative.slownodethreshold 1.0 - Sync the configuration to the slaves.
- start-mapred.sh
Configuration conf = new Configuration();
conf.set("mapred.map.tasks.speculative.execution", "false");
conf.set("mapred.reduce.tasks.speculative.execution", "false");
The three properties mapreduce.job.speculative.speculativecap, mapreduce.job.speculative.slowtaskthreshold, and mapreduce.job.speculative.slownodethreshold control when the JobTracker should start a speculative task. Specifically, a speculative task for a regular task will start when the following conditions are met:
- Speculative execution is enabled.
- The completion rate, in percentage, of the regular task is less than mapreduce.job.speculative.slowtaskthreshold times the mean completion rate of all other tasks.
- The completion rate, in percentage, of the regular task is less than mapreduce.job.speculative.slownodethreshold times the mean completion rate of all other tasks on the current TaskTracker.
- The number of launched speculative tasks is smaller than the configured speculative map.
Setting proper number of map and reduce slots for the TaskTracker
The number of map and reduce slots determines the number of concurrent map/reduce tasks for a TaskTracker, which forks multiple JVMs to run these tasks.
Configure map/reduce slots for a TaskTracker:- stop-mapred.sh
- Configrue the map slots in the $HADOOP_HOME/conf/mapred-site.xml:
mapred.tasktracker.map.tasks.maximum 4 - Similarly, configure the number of reduce slots for a TaskTracker:
mapred.tasktracker.reduce.tasks.maximum 4 - Configure the memory usage for each slot:
mapred.child.java.opts -Xmx1024m - Sync the configuration to all the slave nodes
- start-mapred.sh
Tuning the JobTracker Configuration
Configure the JobTracker in the $HADOOP_HOME/conf/mapred-site.xml:- stop-mapred.sh
- Configure the maximum number of tasks for a job:
mapred.jobtracker.maxtasks.per.jos 3000 - Configure the JobTracker to recover upon restart:
mapred.jobtracker.restart.recover true - Configure the block size for the job history file
mapred.jobtracker.job.history.block.size 3145728 - Configure the task scheduler for the JobTracker:
mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.JobQueueTaskScheduler - Configure the maximum running tasks for a job:
mapred.jobtracker.taskScheduler.maxRunningTasksPerJob 20
Tuning the TaskTracker Configuration
Configure the TaskTracker property in the $HADOOP_HOME/conf/mapred-site.xml:- mapred-stop.sh
- Configure the MapReduce cluster heartbeat interval:
mapred.tasktracker.expiry.interval 600000 - Configure the sleep time before sending the SIGKILL signal
mapred.tasktracker.tasks.sleeptime-before-sigkill 6000 - Enable the TaskTracker memory management:
mapred.tasktracker.tasks.maxmemory true - Configure the TaskTracker index cache size to be 20MB:
mapred.tasktracker.indexcache.mb 20 - Configure the monitoring interval for the TaskTracker's task memory manager:
mapred.tasktracker.taskmemorymanager.monitoring-interval 5000 - Configure the TaskTracker to send an out-of-band heartbeat on task completion:
mapreduce.tasktracker.outofband.heartbeat true - Confiugre the maximum number of retries for a map task:
mapred.map.max.attempts 4 - Configure the maximum number of retries for a failed reduce task:
mapred.reduce.max.attempts 4 - Sync the configuration from the master node to all the slave nodes.
- start-mapred.sh
Tuning shuffle, merge, and sort parameters
In a MapReduce job, map tasks outputs are aggregated into JVM buffers. The size of the in-memory buffer determines how large the data can be merged and sorted at once. Too small a buffer size can cause a large number of swap operations, incurring big overhead.
Configure the sorting parameters in the $HADOOP_HOME/conf/mapred-site.xml:- stop-mapred.sh
- Configure the buffer size, in megabytes, for sorting by changing the property:
io.sort.mb 100 - Configure the merger factor
io.sort.factor 100 - Change the percentage of buffer dedicated for record collection:
io.sort.record.percent 0.05 - Change the spill factor for buffers:
io.sort.spill.percent 0.8 - Configure the in-memory threshold:
mapred.inmem.merge.threshold 1000 - Configure the percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle:
mapred.job.shuffle.buffer.percent 0.70 - Configure the threshold to start the in-memory merge:
mapred.job.shuffle.merge.percent 0.66 - Configure the percentage of memeory of retain map outputs during the reduce phase:
mapred.job.reduce.input.buffer.percent 0.0 - Configure the maximum retries in case of fetch failures:
mapreduce.reduce.shuffle.maxfetchfailures 10 - Sync the configuration from the master node to all the slave nodes.
- start-mapred.sh
Configure memory for a Hadoop cluster
Configure memory properties for a Hadoop Cluster in the $HADOOP_HOME/conf/mapred-site.xml:- stop-mapred.sh
- Configure the virtual memory size, in megabytes, for a map task used by a scheduler:
mapred.cluster.map.memory.mb 200 - Similarly, we can configure the virtual memory size, in megabytes, for a reduce task used by a scheduler:
mapred.cluster.reduce.memory.mb 512 - Configure the maximum virtual memory size for a map task used by a scheduler:
mapred.cluster.max.map.memory.mb 512 - Configure the maximum virtual memory size for a reducer task used by a scheduler:
mapred.cluster.max.reduce.memory.mb 512 - Configure the maximum virtual memroy size for a single map task for the job by a scheduler:
mapred.job.map.memory.mb 0.8 - Configure the maximum virtual memory size for a single reduce task for the job used by a scheduler:
mapred.job.reduce.momory.mb 0.8 - Sync the configuration from the master to all the slave nodes.
- start-mapred.sh
Setting proper number of parallel copies
When all or part of the map tasks finish, map outputs will be copied from the map task nodes to the reduce task nodes. The parallel copying strategy is used to increase the transfer throughput.
Configure the number of parallel copies in teh $HADOOP_HOME/conf/mapred-site.xml:- stop-mapred.sh
-
mapred.reduce.parallel.copies 20 - Sync the configuration file from master node to slave nodes.
- start-mapred.sh
Tuning JVM parameters
Configure JVM parameters in the $HADOOP_HOME/conf/mapred-site.xml:- stop-all.sh
-
mapred.child.java.opts -Xmx512M - Sync the configuraiton file from master node to slave nodes.
- start-all.sh
Configuring JVAM Reuse
MapReduce tasks are executed by JVM processes/threads, which are forked by the TaskTracker. The creation of a JVM, which include the initialization of execuation environmnet, is costly, especially when the number of tasks is large. In the default configuration, the number of JVMs needed to finish a job should be equal to the number of the tasks. In other words, the default setting uses one JVM to execute one task. When the execution of a task completes, its JVM will be killed by the TaskTracker.
JVM Reuse is an optimization of reusing JVMs for multiple tasks. If it is enabled, multiple tasks can be executed sequentially with one JVM.
Configure JVM Reuse in the $HADOOP_HOME/conf/mapred-site.xml:- stop-mapred.sh
-
mapred.job.reuse.jvm.num.tasks 2 - Sync the configuration file from master node to slave nodes.
- start-mapred.sh
Configuring the reducer intialization time
Reduce tasks can be startd when a certain percentage of map tasks has been finished. By setting this property with a smaller with a smaller number, the reduce tasks will start earlier, occupying the computing slots. Once the other hand, if the number is set too large, for example, very close to 1, the reduce tasks will have to wait for the majority of the map tasks to finish, prolonging the job execution time.
Configure reducer initialization time in the $HADOOP_HOME/conf/mapred-site.xml:- stop-mapred.sh
-
mapred.reduce.slowstart.completed.maps 0.05 - Sync the configuration file to all the slave nodes.
- start-mapred.sh
Seems more research has been done to create this blog as the information is very good on this blog. To this I also attending hadoop online training, which is adding to my knowledge more.
回复删除