Tutorial: Create an end-to-end data pipeline to derive sales insights in Azure HDInsight
Grein
In this tutorial, you build an end-to-end data pipeline that performs extract, transform, and load (ETL) operations. The pipeline uses Apache Spark and Apache Hive clusters running on Azure HDInsight for querying and manipulating the data. You also use technologies like Azure Data Lake Storage Gen2 for data storage and Power BI for visualization.
This data pipeline combines data from various stores, removes unwanted data, appends new data, and loads the data back to your storage to visualize business insights. For more information about ETL pipelines, see Extract, transform, and load at scale.
If you don't have an Azure subscription, create a free account before you begin.
If you use PowerShell to trigger the Azure Data Factory pipeline, you need the Az PowerShell module.
Power BI Desktop to visualize business insights at the end of this tutorial.
Create resources
This section shows you how to create resources.
Clone the repository with scripts and data
Sign in to your Azure subscription. If you plan to use Azure Cloud Shell, select Try it in the upper-right corner of the code block. Otherwise, enter the following command:
Azure CLI
az login# If you have multiple subscriptions, set the one to use# az account set --subscription "SUBSCRIPTIONID"
Ensure that you're a member of the Azure role Owner. Replace user@contoso.com with your account, and then enter the command:
Azure CLI
az role assignment list \
--assignee"user@contoso.com" \
--role"Owner"
If a record isn't returned, you aren't a member and can't complete this tutorial.
A user-assigned managed identity. This account gives the HDInsight clusters access to the Data Lake Storage Gen2 account.
An Apache Spark cluster. This cluster is used to clean up and transform the raw data.
An Apache Hive Interactive Query cluster. You can use this cluster to query the sales data and visualize it with Power BI.
An Azure virtual network supported by network security group rules. This virtual network allows the clusters to communicate and secures their communications.
Cluster creation can take around 20 minutes.
The default password for Secure Shell (SSH) protocol access to the clusters is Thisisapassword1. If you want to change the password, go to the ./templates/resourcesparameters_remainder.json file and change the password for the sparksshPassword, sparkClusterLoginPassword, llapClusterLoginPassword, and llapsshPassword parameters.
Verify deployment and collect resource information
If you want to check the status of your deployment, go to the resource group in the Azure portal. Under Settings, select Deployments. Then select your deployment. Here you can see the resources that successfully deployed and the resources that are still in progress.
To view the names of the clusters, enter the following command:
Azure Data Factory is a tool that helps automate Azure Pipelines. It's not the only way to accomplish these tasks, but it's a great way to automate the processes. For more information on Data Factory, see the Data Factory documentation.
This data factory has one pipeline with two activities:
The first activity copies the data from Blob Storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
The second activity transforms the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
To set up your Data Factory pipeline, run the following command. You should still be at the hdinsight-sales-insights-etl directory.
Fills in the actual name of your Data Lake Storage Gen2 storage account in the sparktransform.py and query.hql files.
Obtains storage keys for the Data Lake Storage Gen2 and Blob Storage accounts.
Creates another resource deployment to create a Data Factory pipeline with its associated linked services and activities. It passes the storage keys as parameters to the template file so that the linked services can access the storage accounts correctly.
Run the data pipeline
This section shows you how to run the data pipeline.
Trigger the Data Factory activities
The first activity in the Data Factory pipeline that you created moves the data from Blob Storage to Data Lake Storage Gen2. The second activity applies the Spark transformations on the data and saves the transformed .csv files to a new location. The entire pipeline might take a few minutes to finish.
To retrieve the Data Factory name, enter the following command:
To trigger the pipeline, you have two options. You can:
Trigger the Data Factory pipeline in PowerShell. Replace RESOURCEGROUP and DataFactoryName with the appropriate values, and then run the following commands:
PowerShell
# If you have multiple subscriptions, set the one to use# Select-AzSubscription -SubscriptionId "<SUBSCRIPTIONID>"$resourceGroup="RESOURCEGROUP"$dataFactory="DataFactoryName"$pipeline =Invoke-AzDataFactoryV2Pipeline `
-ResourceGroupName$resourceGroup `
-DataFactory$dataFactory `
-PipelineName"IngestAndTransform"Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName$resourceGroup `
-DataFactoryName$dataFactory `
-PipelineRunId$pipeline
Re-execute Get-AzDataFactoryV2PipelineRun as needed to monitor progress.
To verify that the pipeline has run, take one of the following steps:
Go to the Monitor section in your data factory through the portal.
In Azure Storage Explorer, go to your Data Lake Storage Gen2 storage account. Go to the files file system, and then go to the transformed folder. Check the folder contents to see if the pipeline succeeded.
From Navigator, select sales or sales_raw to preview the data. After the data is loaded, you can experiment with the dashboard that you want to create. To get started with Power BI dashboards, see the following articles:
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.