Introduction
MapReduce is a data-parallel framework, originally created by Google1, for processing big-data applications on large suites of computers. The system attempts to decrease interprocess communication by moving computation toward data. It also transparently tolerates data and computation faults, both highly probable in large-scale cloud settings. Developers, consequently, recognize MapReduce for its scalability, fault tolerance, and elasticity. Now, there are many implementations of the MapReduce programming model available to users on the cloud, many of which rely on Apache Hadoop.
Since its debut, MapReduce has been frequently associated with Hadoop,3, 4 an open-source implementation of MapReduce. Academic, government, and industrial use of Hadoop is growing rapidly.2 For instance, Yahoo! uses it for around 80% to 90% of its jobs5. Others, such as Facebook and Microsoft, have also advocated for the framework.6 Academia currently uses Hadoop for seismic simulation, natural-language processing, and web data mining, among other applications.7, 8
Hadoop MapReduce performs most of the labor involved in implementing cloud programs through three main strategies:
- It automatically breaks down jobs into distributed tasks to effectively exploit task parallelism.
- It considers data locality and variations in overall system workloads to schedule jobs and tasks efficiently at participating cluster nodes.
- It transparently tolerates both data and task failures.
Learning objectives
In this module, you will:
- Identify the underlying distributed programming model of MapReduce
- Explain how MapReduce can exploit data parallelism
- Identify the input and output of map and reduce tasks
- Define task elasticity, and indicate its importance for effective job scheduling
- Explain the map and reduce task-scheduling strategies in Hadoop MapReduce
- List the elements of the YARN architecture, and identify the role of each element
- Summarize the lifecycle of a MapReduce job in YARN
- Compare and contrast the architectures and the resource allocators of YARN and the previous Hadoop MapReduce
- Indicate how job and task scheduling differ in YARN as opposed to the previous Hadoop MapReduce
Prerequisites
- Understand what cloud computing is, including cloud service models and common cloud providers
- Know the technologies that enable cloud computing
- Understand how cloud service providers pay for and bill for the cloud
- Know what datacenters are and why they exist
- Know how datacenters are set up, powered, and provisioned
- Understand how cloud resources are provisioned and metered
- Be familiar with the concept of virtualization
- Know the different types of virtualization
- Understand CPU virtualization
- Understand memory virtualization
- Understand I/O virtualization
- Know about the different types of data and how they're stored
- Be familiar with distributed file systems and how they work
- Be familiar with NoSQL databases and object storage, and how they work
- Know what distributed programming is and why it's useful for the cloud
References
- J. Dean and S. Ghemawat (Dec. 2004). MapReduce: Simplified Data Processing on Large Clusters OSDI
- H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu (2011). Starfish: A Self-Tuning System for Big Data Analytics CIDR
- Z. Fadika and M. Govindaraju (Dec. 2010). LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications IEEE 2nd Int. Conf. on Cloud Computing Technology and Science (CloudCom)
- Hadoop
- N. B. Rizvandi, A. Y. Zomaya, A. J. Boloori, and J. Taheri (2011). Preliminary Results: Modeling Relation between Total Execution Time of MapReduce Applications and Number of Mappers/Reducers Tech. R. 679, The University of Sydney
- S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi (Dec. 2010). LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud CloudCom
- M. Hammoud and M. F. Sakr (2011). Locality-Aware Reduce Task Scheduling for MapReduce CloudCom
- M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica (2008). Improving MapReduce Performance in Heterogeneous Environments OSDI