Distributed computing on the cloud: MapReduce

Beginner
Developer
Student
Azure

MapReduce was a breakthrough in big data processing that has become mainstream and been improved upon significantly. Learn about how MapReduce works.

Learning objectives

In this module, you will:

  • Identify the underlying distributed programming model of MapReduce
  • Explain how MapReduce can exploit data parallelism
  • Identify the input and output of map and reduce tasks
  • Define task elasticity, and indicate its importance for effective job scheduling
  • Explain the map and reduce task-scheduling strategies in Hadoop MapReduce
  • List the elements of the YARN architecture, and identify the role of each element
  • Summarize the lifecycle of a MapReduce job in YARN
  • Compare and contrast the architectures and the resource allocators of YARN and the previous Hadoop MapReduce
  • Indicate how job and task scheduling differ in YARN as opposed to the previous Hadoop MapReduce

In partnership with Dr. Majd Sakr and Carnegie Mellon University.

Prerequisites

  • Understand what cloud computing is, including cloud service models and common cloud providers
  • Know the technologies that enable cloud computing
  • Understand how cloud service providers pay for and bill for the cloud
  • Know what datacenters are and why they exist
  • Know how datacenters are set up, powered, and provisioned
  • Understand how cloud resources are provisioned and metered
  • Be familiar with the concept of virtualization
  • Know the different types of virtualization
  • Understand CPU virtualization
  • Understand memory virtualization
  • Understand I/O virtualization
  • Know about the different types of data and how they're stored
  • Be familiar with distributed file systems and how they work
  • Be familiar with NoSQL databases and object storage, and how they work
  • Know what distributed programming is and why it's useful for the cloud