Solution ideas
This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.
This solution idea illustrates how to extract, transform, and load your big data clusters on demand by using Hadoop MapReduce and Apache Spark.
Architecture
Dataflow
The data flows through the architecture as follows:
Using Azure Data Factory, establish Linked Services to source systems and data stores. Azure Data Factory Pipelines support 90+ connectors that also include generic protocols for data sources where a native connector isn't available.
Load data from source systems into Azure data lake with the Copy Data tool.
Azure Data Factory is able to create an on-demand HDInsight cluster. Start by creating an On-Demand HDInsight Linked Service. Next, create a pipeline and use the appropriate HDInsight activity depending on the Hadoop framework being used (that is, Hive, MapReduce, Spark, etc.).
Trigger the pipeline in Azure Data Factory. The architecture assumes Azure Data Lake store is used as the file system in the Hadoop script executed by the HDInsight activity which was created in Step 3. The script will be executed by an on-demand HDInsight cluster that will write data to a curated area of the data lake.
Components
- Azure Data Factory - Cloud scale data integration service for orchestrating data flow.
- Azure Data Lake Storage - Scalable and cost-effective cloud storage for big data processing.
- Apache Hadoop - Big data distributed processing framework
- Apache Spark - Big data distributed processing framework that supports in-memory processing to boost performance for big data applications.
- Azure HDInsight - Cloud distribution of Hadoop components.
Scenario details
This solution idea describes the data flow for an ETL use case.
Potential use cases
You can use Azure HDInsight for various scenarios in big data processing. It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). For more information about processing such data, see Scenarios for using HDInsight.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
- Jon Dobrzeniecki | Cloud Solution Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
Next steps
Learn more about the component technologies:
- Tutorial: Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory
- Introduction to Azure Data Factory
- Introduction to Azure Data Lake Storage Gen2
- Load data into Azure Data Lake Storage Gen2 with Azure Data Factory
- What is Apache Hadoop in Azure HDInsight?
- Invoke MapReduce Programs from Data Factory
- Use MapReduce in Apache Hadoop on HDInsight
- What is Apache Spark in Azure HDInsight
Related resources
Explore related architectures: