2014年5月14日星期三

Reuse JVM in Hadoop MapReduce job

Reuse JVM in Hadoop MapReduce Job

Q:
I know we can set the property "mapred.job.reuse.jvm.num.tasks" to re-use JVM. My questions are:
(1) how to decide the number of tasks to be set here, -1 or some other positive integers?
(2) is it a good idea to already reuse JVMs and set this property to the value of -1 in mapreduce jobs?
Thank you very much!
A:
If you have very small tasks that are definitely running after each other, it is useful to set this property to -1 (meaning that a spawned JVM will be reused unlimited times). So you just spawn (number of task in your cluster available to your job)-JVMs instead of (number of tasks)-JVMs.

This is a huge performance improvement. In long running jobs the percentage of the runtime in comparision to setup a new JVM is very low, so it doesn't give you a huge performance boost.

Also in long running tasks it is good to recreate the task process, because of issues like heap fragmentation degrading your performance.

In addition, if you have some mid-time-running jobs, you could reuse just 2-3 of the tasks, having a good trade-off.
据回答,差不多就是任务小的系统中,mapred.job.reuse.jvm.num.tasks 对性能的提高有很大的作用;在任务大时,最好别启用就是了。

PS: Hadoop 2.x.x 中该参数改为 mapreduce.job.jvm.numtasks 了。或许是因为先入为主,反正觉得 mapred.job.reuse.jvm.num.tasks 看来易懂。

没有评论 :

发表评论