Extract, transform, and load (ETL) using HDInsight

Data Factory
Data Lake Storage Gen2
HDInsight

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.

Extract, transform, and load your big data clusters on demand with Hadoop MapReduce and Apache Spark.

Potential use cases

Azure HDInsight can be used for various scenarios in big data processing. It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). The scenarios for processing such data can be summarized in the following doc, Scenarios for using HDInsight. This solution idea covers the data flow for an ETL use case.

Architecture

Architecture diagram

Dataflow

The data flows through the architecture as follows:

  1. Using Azure Data Factory, establish Linked Services to source systems and data stores. Azure Data Factory Pipelines support 90+ connectors that also include generic protocols for data sources where a native connector isn't available.

  2. Load data from source systems into Azure data lake with the Copy Data tool.

  3. Azure Data Factory is able to create an on-demand HDInsight cluster. Start by creating an On-Demand HDInsight Linked Service. Next, create a pipeline and use the appropriate HDInsight activity depending on the Hadoop framework being used (that is, Hive, MapReduce, Spark, etc.).

  4. Trigger the pipeline in Azure Data Factory. The architecture assumes Azure Data Lake store is used as the file system in the Hadoop script executed by the HDInsight activity which was created in Step 3. The script will be executed by an on-demand HDInsight cluster that will write data to a curated area of the data lake.

Components

  • Azure Data Factory - Cloud scale data integration service for orchestrating data flow.
  • Azure Data Lake Storage - Scalable and cost-effective cloud storage for big data processing.
  • Apache Hadoop - Big data distributed processing framework
  • Apache Spark - Big data distributed processing framework that supports in-memory processing to boost performance for big data applications.
  • Azure HDInsight - Cloud distribution of Hadoop components.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Next steps

Learn more about the component technologies:

Explore related architectures: