Project OMO: Optimize MapReduce Overlap with a Good Start (Reduce) and a Good Finish (Map).



The Problem:

We consider that a large scale Hadoop cluster is launched to serve a large volume of MapReduce jobs.

  • A typical MapReduce job includes many identical map tasks and much fewer reduce tasks.
  • A Hadoop cluster includes a master node and multiple slave nodes.

Each node is configured with multiple slots which indicate its capacity of serving tasks. A slot can be set as a map slot or reduce slot to serve one map task or reduce task, respectively.

For MapReduce jobs, the intermediate output of the map phase serves as the input of the reduce phase. Since reduce tasks need the output of map tasks, thus cannot be finished before the map phase is done. However, reduce tasks can start earlier before the completion of the map phase (for transferring/shuffling the intermediate data). These factors make the scheduling design extremely challenging yet the existing products have not thoroughly addressed these issues:

  • Issue 1: The impact on the makespan of a job with different value of slowstart (a fractional value representing the threshold for the map phase's progress exceeding which reduce tasks will be allowed to execute).

  • Issue 2: Impact on the makespan of a job with the misalignment of the tailing map tasks.
Our Solution:

We plan to develop a new strategy, OMO, to optimize the overlap between map and reduce phases. We observe that this overlapping period plays an important role in the MapReduce process and a good alignment of map and reduce phases can reduce the job execution time.

Our solution includes two new techniques:
  • Lazy start of reduce tasks: estimate execution time of the map phase and the shuffling step of the reduce phase, and derives the best time to start the reduce phase in order to minimize the gap from the end of the map phase to the end of the shuffling phase.
  • Batch finish of map tasks: mitigate the extra overhead caused by the misalignment of the tailing map tasks by increasing the execution priority of the tailing map tasks in order to finish them in a wave.
In addition, we introduce a new monitoring component that records the number of slots released in the past. This information serves the new techniques in our solution to predict the slot release frequency in the future.