Distributed computing on the cloud: MapReduce
MapReduce was a breakthrough in big data processing that has become mainstream and been improved upon significantly. Learn about how MapReduce works.
Learning objectives
In this module, you will:
- Identify the underlying distributed programming model of MapReduce
- Explain how MapReduce can exploit data parallelism
- Identify the input and output of map and reduce tasks
- Define task elasticity, and indicate its importance for effective job scheduling
- Explain the map and reduce task-scheduling strategies in Hadoop MapReduce
- List the elements of the YARN architecture, and identify the role of each element
- Summarize the lifecycle of a MapReduce job in YARN
- Compare and contrast the architectures and the resource allocators of YARN and the previous Hadoop MapReduce
- Indicate how job and task scheduling differ in YARN as opposed to the previous Hadoop MapReduce
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
Prerequisites
- Understand what cloud computing is, including cloud service models and common cloud providers
- Know the technologies that enable cloud computing
- Understand how cloud service providers pay for and bill for the cloud
- Know what datacenters are and why they exist
- Know how datacenters are set up, powered, and provisioned
- Understand how cloud resources are provisioned and metered
- Be familiar with the concept of virtualization
- Know the different types of virtualization
- Understand CPU virtualization
- Understand memory virtualization
- Understand I/O virtualization
- Know about the different types of data and how they're stored
- Be familiar with distributed file systems and how they work
- Be familiar with NoSQL databases and object storage, and how they work
- Know what distributed programming is and why it's useful for the cloud