Perform advanced streaming data transformations with Apache Spark and Kafka in Azure HDInsight

Intermediate
Data Engineer
Data Scientist
Azure HDInsight

In this module, you will learn how to create real-time streaming data analytics pipelines and applications on the cloud by using Azure HDInsight with Apache Kafka and Apache Spark.

Learning objectives

At the end of this module you will understand:

  • When to use Apache Spark and Kafka with HDInsight
  • How Spark Structured Streaming works
  • The architecture of a Kafka and Spark solution
  • How to provision HDInsight, create a Kafka producer, and stream Kafka data to a Jupyter notebook
  • How to replicate data to a secondary cluster

Prerequisites

The following pre-requisite should be completed:

  • Successfully login to the Azure portal
  • Understand the Azure storage options
  • Understand the Azure compute options
  • Create and configure a HDInsight Cluster in the Azure portal